Spaces:

Mayank022
/

api-testing-env

Sleeping

App Files Files Community

Mayank022 commited on Apr 7

Commit

a4f74f3

verified ·

1 Parent(s): 592f160

Upload folder using huggingface_hub

Browse files

Files changed (43) hide show

Dockerfile +61 -0
README.md +371 -4
__init__.py +1 -0
baseline.py +6 -0
client.py +85 -0
data/tasks.json +131 -0
eval_trained.py +141 -0
gradio_app.py +627 -0
inference.py +432 -0
models.py +110 -0
openenv.yaml +6 -0
openenv_api_testing.egg-info/PKG-INFO +19 -0
openenv_api_testing.egg-info/SOURCES.txt +26 -0
openenv_api_testing.egg-info/dependency_links.txt +1 -0
openenv_api_testing.egg-info/entry_points.txt +2 -0
openenv_api_testing.egg-info/requires.txt +16 -0
openenv_api_testing.egg-info/top_level.txt +1 -0
pyproject.toml +60 -0
requirements.txt +27 -0
server/__init__.py +0 -0
server/app.py +135 -0
server/bug_detector.py +430 -0
server/buggy_api/__init__.py +0 -0
server/buggy_api/database.py +209 -0
server/buggy_api/main.py +91 -0
server/buggy_api/models.py +64 -0
server/buggy_api/routes/__init__.py +0 -0
server/buggy_api/routes/auth.py +82 -0
server/buggy_api/routes/tasks.py +210 -0
server/buggy_api/routes/users.py +63 -0
server/environment.py +438 -0
server/graders.py +289 -0
server/reward.py +238 -0
setup.sh +158 -0
train_grpo.py +6 -0
training/README.md +392 -0
training/__init__.py +10 -0
training/agents.py +190 -0
training/evaluate.py +318 -0
training/grpo.py +783 -0
training/prompts.py +398 -0
training/rewards.py +209 -0
uv.lock +0 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,61 @@

+# Multi-stage build using openenv-base
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Install git (needed for VCS dependencies)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+COPY . /app/env
+WORKDIR /app/env
+# Ensure uv is available
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Copy virtual environment from builder
+COPY --from=builder /app/env/.venv /app/.venv
+# Copy application code
+COPY --from=builder /app/env /app/env
+# Set PATH and PYTHONPATH
+ENV PATH="/app/.venv/bin:$PATH"
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
+# Enable web interface
+ENV ENABLE_WEB_INTERFACE=true
+# Run the server
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

README.md CHANGED Viewed

@@ -1,10 +1,377 @@
 ---
-title: Api Testing Env
-emoji: 📉
-colorFrom: red
 colorTo: purple
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: API Testing Environment
+emoji: 🛡️
+colorFrom: indigo
 colorTo: purple
 sdk: docker
+app_port: 8000
 pinned: false
+license: mit
+base_path: /web
 ---
+# API Testing Environment for OpenEnv
+An RL environment that trains AI agents to become **automated API security testers** — discovering endpoints, crafting requests, finding vulnerabilities mapped to the **OWASP API Security Top 10**, and generating structured bug bounty reports.
+The agent explores a deliberately buggy Task Management API with 13 planted vulnerabilities across 6 OWASP categories. It earns rewards for coverage, correctness, and bug discovery. At episode end, a security assessment report is auto-generated.
+---
+## Why This Matters
+- Every software team tests APIs manually or with hand-written test suites
+- Existing tools (Postman, Schemathesis, OWASP ZAP) require manual test design or brute-force fuzzing
+- Academic research shows RL **outperforms traditional tools** in coverage and fault-finding (ARAT-RL, IEEE/ACM 2023; APIRL, AAAI 2025)
+- This environment provides a standardized RL training ground with **verifiable rewards** — deterministic bug detection, not LLM judges
+---
+## OWASP Coverage
+All 13 bugs are mapped to the OWASP API Security Top 10 (2023):
+| OWASP Category | Bugs | Description |
+|---------------|------|-------------|
+| **API1** Broken Object Level Authorization | BUG_TASK_07, BUG_AUTH_01 | Users can access/modify other users' resources |
+| **API2** Broken Authentication | BUG_AUTH_02 | Login succeeds with empty password |
+| **API3** Broken Object Property Level Auth | BUG_USER_02 | Response exposes password_hash field |
+| **API4** Unrestricted Resource Consumption | BUG_TASK_06, BUG_TASK_08 | No pagination cap, long input crashes server |
+| **API8** Security Misconfiguration | BUG_TASK_01-05, BUG_TASK_09, BUG_USER_01 | Wrong status codes, missing validation, stored injection |
+---
+## Architecture
+```
+┌──────────────────────────────────────────────────────────┐
+│                   OpenEnv Server (:8000)                  │
+│                                                          │
+│  Agent ──action──> environment.py                        │
+│        <──obs────  │                                     │
+│                    ├──> buggy_api/ (in-process FastAPI)   │
+│                    │    └── routes/ (tasks, users, auth)  │
+│                    │    └── database.py (SQLite, reset    │
+│                    │        with seed for randomization)  │
+│                    │                                     │
+│                    ├──> bug_detector.py (13 detectors)   │
+│                    ├──> reward.py (5-signal rewards)     │
+│                    └──> graders.py (scoring + bug report)│
+└──────────────────────────────────────────────────────────┘
+```
+Each `reset(seed=N)` creates a unique database with different users, tasks, and data — preventing memorization during GRPO training.
+---
+## Planted Bugs (13 vulnerabilities)
+| ID | Severity | OWASP | Description |
+|----|----------|-------|-------------|
+| BUG_TASK_01 | Easy | API8 | GET /tasks/{id} returns 200+null for missing task (should be 404) |
+| BUG_TASK_02 | Easy | API8 | POST /tasks without title returns 500 (should be 400) |
+| BUG_TASK_03 | Easy | API8 | GET /tasks?page=-1 returns 200 (should be 400) |
+| BUG_TASK_04 | Medium | API8 | PUT accepts invalid email format without validation |
+| BUG_TASK_05 | Medium | API8 | DELETE returns 200 for non-existent task (should be 404) |
+| BUG_TASK_06 | Medium | API4 | No pagination cap — limit=999999 accepted |
+| BUG_USER_01 | Medium | API8 | POST /users accepts invalid email |
+| BUG_USER_02 | Medium | API3 | POST /users response exposes password_hash |
+| BUG_AUTH_02 | Medium | API2 | Login with empty password succeeds |
+| BUG_TASK_07 | Hard | API1 | BOLA: any user can access any task (no ownership check) |
+| BUG_TASK_08 | Hard | API4 | Long title (>5000 chars) crashes server with 500 |
+| BUG_TASK_09 | Hard | API8 | SQL injection payload stored verbatim |
+| BUG_AUTH_01 | Hard | API1 | User A's token can modify User B's tasks |
+---
+## Tasks (3 difficulty levels)
+| Task | Difficulty | Steps | Bugs | Focus |
+|------|-----------|-------|------|-------|
+| basic_validation | Easy | 25 | 3 | CRUD testing, status code verification |
+| edge_cases | Medium | 35 | 9 | Invalid inputs, boundary values, chaining |
+| security_workflows | Hard | 45 | 13 | BOLA, auth bypass, injection, state consistency |
+---
+## Reward Function
+Multi-signal partial rewards at each step:
+| Signal | Range | Purpose |
+|--------|-------|---------|
+| **Coverage** | 0.0 - 0.20 | New endpoints, methods, status codes |
+| **Validity** | 0.0 - 0.18 | Well-formed requests, dependency chaining |
+| **Bug discovery** | 0.0 - 0.30 | Severity-scaled: easy=0.10, medium=0.15, hard=0.25 |
+| **Exploration** | 0.0 - 0.05 | Novel action patterns |
+| **Penalty** | -0.08 | Exact duplicate requests |
+Final episode score (0.0 - 1.0) from task-specific grader + auto-generated bug bounty report.
+---
+## Bug Bounty Report
+At episode end, the environment auto-generates a structured security assessment report:
+```
+## API Security Assessment Report
+**Vulnerabilities Found:** 3
+**Critical/Hard:** 0 | **Medium:** 1 | **Low/Easy:** 2
+### MEDIUM: Login with empty password succeeds
+- **ID:** BUG_AUTH_02
+- **OWASP:** API2:2023 Broken Authentication
+- **Recommendation:** Validate password is non-empty and verify against stored hash
+### LOW: GET /tasks/{id} returns 200 with null for non-existent task
+- **ID:** BUG_TASK_01
+- **OWASP:** API8:2023 Security Misconfiguration
+- **Recommendation:** Return 404 Not Found for non-existent resources
+```
+---
+## Setup & Usage
+### Local Development
+```bash
+cd api_testing_env
+uv sync                                      # or: pip install -e .
+# Run the OpenEnv server (also serves the Gradio UI at /ui)
+uv run server                                # or: python -m server.app
+# → http://localhost:8000/         API root + endpoint catalogue
+# → http://localhost:8000/ui       Interactive bug-hunting playground
+# → http://localhost:8000/docs     OpenAPI/Swagger
+# → http://localhost:8000/reset    POST endpoint hit by graders
+# Run heuristic baselines (no LLM required)
+python baseline.py --url http://localhost:8000 --task all --agent all
+```
+### Docker
+```bash
+docker build -t api-testing-env .
+docker run -p 8000:8000 api-testing-env
+curl -X POST http://localhost:8000/reset -H 'Content-Type: application/json' -d '{}'
+```
+### Inference (`inference.py`)
+The submission entry point. Uses an OpenAI-compatible LLM to play all 3 tasks
+and prints the mandatory `[START] / [STEP] / [END]` log lines that the
+OpenEnv judging pipeline parses.
+```bash
+# 1. Set required env vars (see .env.example)
+export API_BASE_URL=https://router.huggingface.co/v1
+export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
+export HF_TOKEN=hf_xxx
+# 2. Choose how to attach to the environment (pick ONE):
+#    (a) in-process (default, fastest, no Docker)
+python inference.py
+#    (b) against a built docker image (matches the OpenEnv sample)
+IMAGE_NAME=api-testing-env:latest python inference.py
+#    (c) against a running server / deployed HF Space
+ENV_BASE_URL=https://your-username-api-testing-env.hf.space python inference.py
+```
+The script makes **one LLM call per task** in plan mode, executes the returned
+JSON action plan against the env, and emits exactly:
+```
+[START] task=basic_validation env=api_testing_env model=Qwen/Qwen2.5-72B-Instruct
+[STEP]  step=1 action=GET_/tasks reward=0.33 done=false error=null
+[STEP]  step=2 action=POST_/tasks reward=0.28 done=false error=null
+...
+[END]   success=true steps=17 score=0.820 rewards=0.33,0.28,...
+```
+Each per-task `score` is normalized to **[0, 1]** as
+`0.7 * (bugs_found / total_bugs) + 0.3 * (coverage_pct / 100)`. Total runtime
+is well under 20 minutes on a 2 vCPU / 8 GB box because there are only 3 LLM
+calls and ~50 in-process API requests.
+### Deploy to HuggingFace Spaces
+```bash
+huggingface-cli login
+openenv push --repo-id your-username/api-testing-env
+```
+Validate after deploy:
+```bash
+curl -X POST https://your-username-api-testing-env.hf.space/reset \
+     -H 'Content-Type: application/json' -d '{}'
+# expected: HTTP 200 with the initial observation JSON
+```
+### GRPO Training
+```bash
+pip install trl transformers peft torch datasets
+# Quick test (CPU)
+python -m training.grpo --test-mode
+# Full training (GPU)
+python -m training.grpo \
+  --model-id Qwen/Qwen3-1.7B \
+  --num-episodes 100 \
+  --max-steps 200 \
+  --push-to-hub --hf-repo-id your-username/api-tester-grpo \
+  --use-wandb --wandb-project api-testing-grpo
+```
+The model outputs a **full test plan** (JSON array of 15-25 actions) in one completion. GRPO optimizes complete testing strategies, not single actions. See [training/README.md](training/README.md) for details.
+### Deploy to HuggingFace Spaces
+```bash
+pip install openenv-core
+openenv push --repo-id your-username/api-testing-env
+```
+---
+## Evaluation Results
+We evaluated the environment with **5 different agents** to demonstrate the
+reward signal is meaningful, varied, and learnable. Reproducible with `seed=9999`,
+in-process env mode, plan-based action generation.
+### Inference Submission (`inference.py`)
+The submission entry point uses **`meta-llama/Llama-3.3-70B-Instruct`** via the
+HuggingFace Inference Router. Generates one structured JSON test plan per task,
+executes 20-25 actions, scores normalized to **[0, 1]**.
+```bash
+HF_TOKEN=hf_xxx python inference.py
+```
+| Task | Steps | Bugs Found | Score (0-1) |
+|------|-------|-----------|-------------|
+| basic_validation | 21 | strong | **0.82** |
+| edge_cases | 23 | medium | **0.62** |
+| security_workflows | 24 | medium | **0.58** |
+| **Average** | — | — | **0.67** |
+Total runtime: **~10 seconds** (3 LLM calls, ~50 in-process API requests).
+Comfortably under 20 minutes on a 2 vCPU / 8 GB judging box.
+### Heuristic Baselines (`python -m training.evaluate`)
+No LLM required — pure Python policies. Used as floor/ceiling reference points.
+| Agent | basic_validation | edge_cases | security_workflows |
+|---|---|---|---|
+| `random` (lower bound) | 2.73 | 2.73 | 3.00 |
+| `sequential` (fixed plan) | 4.32 | 4.07 | 3.65 |
+| `smart` (200-line heuristic) | 4.86 | 5.18 | 5.13 |
+The **smart agent has 200+ lines of hand-coded test logic** specifically targeting
+the 13 planted bugs (BOLA, SQL injection, missing fields, etc.). It represents
+the *upper bound a hand-crafted human-designed agent can achieve*.
+### GRPO-Trained Agent (Self-Improving)
+We GRPO fine-tuned `Qwen/Qwen3-1.7B` (1.7B params, with LoRA r=16) for **200 steps**
+against the environment. The training reward function uses the same plan parser as
+`inference.py`. **No human demonstrations, no scripted heuristics — pure RL.**
+| | Base Qwen3-1.7B | GRPO Trained (200 steps) | Improvement |
+|---|---|---|---|
+| basic_validation | 0.00 | **3.48** (2/3 bugs, 50% coverage) | **+3.48** |
+| edge_cases | 0.00 | **3.88** (5/9 bugs, 50% coverage) | **+3.88** |
+| security_workflows | 0.00 | **3.16** (1/13 bugs, **70% coverage**) | **+3.16** |
+| **Average reward** | **0.00** | **3.51** | **+3.51** |
+| Training reward (final) | — | **7.00** | (matches wandb run) |
+**Trained model weights:** [Mayank022/api-tester-v3](https://huggingface.co/Mayank022/api-tester-v3)
+**W&B training run:** `api-testing-grpo-v3` (200 steps, ~5.8 hours on H100)
+#### What this proves
+1. **The base model scored 0.0 on every task** — it couldn't even output valid JSON.
+2. **After 200 GRPO steps**, the same 1.7B model now generates **22-62 action test plans**,
+   discovers real bugs, and reaches **70% coverage** on the hardest task.
+3. **It learned API testing strategies from scratch** — no demos, no scripts, only
+   reward signal from the environment.
+4. **The gap between trained (3.5) and smart heuristic (5.0)** = room for further
+   training. With more steps, larger models, or curriculum learning, this gap closes.
+The **environment is the dataset**. Each `reset(seed=N)` produces a unique database
+(different users, tasks, data), so the agent cannot memorize — it must learn
+generalizable testing strategies.
+### Reward Signal Validation
+| Metric | Value | What it means |
+|---|---|---|
+| Score range | 0.00 → 5.18 | Wide spread = good signal for RL |
+| Easy bug detection rate | 2-3 / 3 | Reachable in 20 steps |
+| Hard bug detection rate | 1-10 / 13 | Skill-dependent |
+| Reward variance (training) | std=3.2 | Healthy GRPO learning signal |
+| Format reward + plan reward + diversity | 3 signals | Decomposed for clean gradients |
+**For judges:** the score gap between random (2.73), trained (3.51), smart (4.86),
+and Llama 70B (norm 0.82) demonstrates the environment **distinguishes agent skill**
+across orders of magnitude — exactly what the OpenEnv evaluator looks for.
+---
+## Project Structure
+```
+api_testing_env/
+├── inference.py                 # SUBMISSION ENTRY POINT — OpenAI client, [START]/[STEP]/[END]
+├── models.py                    # APITestAction, APITestObservation, APITestState
+├── client.py                    # EnvClient subclass (WebSocket)
+├── openenv.yaml                 # OpenEnv manifest
+├── pyproject.toml               # Dependencies (incl. openai, gradio)
+├── Dockerfile                   # Container for HuggingFace Spaces
+│
+├── server/                      # ENVIRONMENT (OpenEnv core)
+│   ├── app.py                   #   FastAPI server (create_app)
+│   ├── environment.py           #   reset() / step() / state()
+│   ├── bug_detector.py          #   13 OWASP-labeled bug detectors
+│   ├── reward.py                #   5-signal reward computation
+│   ├── graders.py               #   Task scoring + bug bounty report
+│   └── buggy_api/               #   The deliberately buggy REST API
+│       ├── main.py              #     FastAPI app factory
+│       ├── database.py          #     In-memory SQLite (seed-randomized)
+│       ├── models.py            #     Pydantic schemas
+│       └── routes/              #     tasks.py, users.py, auth.py
+│
+├── training/                    # GRPO TRAINING
+│   ├── prompts.py               #   System prompts + action parsing
+│   ├── rewards.py               #   Plan-based reward functions
+│   ├── agents.py                #   Baseline agents (random/sequential/smart)
+│   ├── grpo.py                  #   GRPO training loop (TRL + LoRA)
+│   └── evaluate.py              #   Rollout runner + evaluation
+│
+├── gradio_app.py                # Interactive UI dashboard
+├── baseline.py                  # Wrapper -> training/evaluate.py
+├── train_grpo.py                # Wrapper -> training/grpo.py
+└── data/tasks.json              # Task definitions + bug registry
+```
+---
+## References
+- [OWASP API Security Top 10 (2023)](https://owasp.org/API-Security/)
+- [APIRL: Deep RL for REST API Fuzzing (AAAI 2025)](https://arxiv.org/abs/2412.15991)
+- [ARAT-RL: Adaptive REST API Testing with RL (IEEE/ACM 2023)](https://codingsoo.github.io/publication/2024-adaptive-rest-api-testing-rl)
+- [GRPO: Group Relative Policy Optimization (Shao et al. 2024)](https://arxiv.org/abs/2402.03300)
+- [DeepSeek-R1: Verifiable Rewards for RL (2024)](https://arxiv.org/abs/2401.02954)
+- [OpenEnv Framework](https://meta-pytorch.org/OpenEnv/index.html)

__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # API Testing Environment for OpenEnv

baseline.py ADDED Viewed

	@@ -0,0 +1,6 @@

+#!/usr/bin/env python3
+"""Baseline evaluation — see training/evaluate.py for the full implementation."""
+from training.evaluate import main
+if __name__ == "__main__":
+    main()

client.py ADDED Viewed

	@@ -0,0 +1,85 @@

+"""API Testing Environment Client."""
+from typing import Dict
+from openenv.core.client_types import StepResult
+from openenv.core import EnvClient
+from .models import APITestAction, APITestObservation, APITestState
+class APITestEnv(
+    EnvClient[APITestAction, APITestObservation, APITestState]
+):
+    """
+    Client for the API Testing Environment.
+    Example:
+        >>> with APITestEnv(base_url="http://localhost:8000") as client:
+        ...     result = client.reset(task_id="basic_validation")
+        ...     print(result.observation.feedback)
+        ...     result = client.step(APITestAction(
+        ...         method="GET", endpoint="/tasks", expected_status=200
+        ...     ))
+        ...     print(result.observation.status_code)
+    """
+    def __init__(self, base_url: str, **kwargs):
+        kwargs.setdefault("message_timeout_s", 120.0)
+        super().__init__(base_url=base_url, **kwargs)
+    def _step_payload(self, action: APITestAction) -> Dict:
+        return {
+            "method": action.method.value if hasattr(action.method, "value") else str(action.method),
+            "endpoint": action.endpoint,
+            "headers": action.headers or {},
+            "query_params": action.query_params or {},
+            "body": action.body,
+            "expected_status": action.expected_status,
+        }
+    def _parse_result(self, payload: Dict) -> StepResult[APITestObservation]:
+        obs_data = payload.get("observation", {})
+        observation = APITestObservation(
+            available_endpoints=obs_data.get("available_endpoints", []),
+            status_code=obs_data.get("status_code", 0),
+            response_body=obs_data.get("response_body"),
+            response_headers=obs_data.get("response_headers", {}),
+            response_time_ms=obs_data.get("response_time_ms", 0.0),
+            feedback=obs_data.get("feedback", ""),
+            bugs_found_so_far=obs_data.get("bugs_found_so_far", 0),
+            coverage_summary=obs_data.get("coverage_summary", {}),
+            known_resource_ids=obs_data.get("known_resource_ids", {}),
+            auth_tokens=obs_data.get("auth_tokens", {}),
+            task_id=obs_data.get("task_id", ""),
+            task_description=obs_data.get("task_description", ""),
+            steps_taken=obs_data.get("steps_taken", 0),
+            max_steps=obs_data.get("max_steps", 30),
+            done=payload.get("done", False),
+            reward=payload.get("reward"),
+            metadata=obs_data.get("metadata", {}),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict) -> APITestState:
+        return APITestState(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+            task_id=payload.get("task_id", ""),
+            task_description=payload.get("task_description", ""),
+            difficulty=payload.get("difficulty", "easy"),
+            steps_taken=payload.get("steps_taken", 0),
+            max_steps=payload.get("max_steps", 30),
+            bugs_found=payload.get("bugs_found", 0),
+            total_bugs=payload.get("total_bugs", 0),
+            bugs_found_ids=payload.get("bugs_found_ids", []),
+            coverage_pct=payload.get("coverage_pct", 0.0),
+            endpoints_tested=payload.get("endpoints_tested", 0),
+            total_endpoints=payload.get("total_endpoints", 0),
+            current_score=payload.get("current_score", 0.0),
+            cumulative_reward=payload.get("cumulative_reward", 0.0),
+        )

data/tasks.json ADDED Viewed

	@@ -0,0 +1,131 @@

+{
+  "tasks": [
+    {
+      "id": "basic_validation",
+      "name": "Basic Endpoint Validation",
+      "difficulty": "easy",
+      "description": "Test all CRUD endpoints with valid inputs and verify correct status codes.",
+      "max_steps": 25,
+      "bugs": ["BUG_TASK_01", "BUG_TASK_02", "BUG_TASK_03"]
+    },
+    {
+      "id": "edge_cases",
+      "name": "Edge Cases & Error Handling",
+      "difficulty": "medium",
+      "description": "Test boundary conditions, invalid inputs, and error responses.",
+      "max_steps": 35,
+      "bugs": [
+        "BUG_TASK_01", "BUG_TASK_02", "BUG_TASK_03",
+        "BUG_TASK_04", "BUG_TASK_05", "BUG_TASK_06",
+        "BUG_USER_01", "BUG_USER_02", "BUG_AUTH_02"
+      ]
+    },
+    {
+      "id": "security_workflows",
+      "name": "Security & Multi-Step Workflows",
+      "difficulty": "hard",
+      "description": "Discover authorization flaws, injection vulnerabilities, and workflow bugs.",
+      "max_steps": 45,
+      "bugs": [
+        "BUG_TASK_01", "BUG_TASK_02", "BUG_TASK_03",
+        "BUG_TASK_04", "BUG_TASK_05", "BUG_TASK_06",
+        "BUG_TASK_07", "BUG_TASK_08", "BUG_TASK_09",
+        "BUG_USER_01", "BUG_USER_02",
+        "BUG_AUTH_01", "BUG_AUTH_02"
+      ]
+    }
+  ],
+  "bug_registry": {
+    "BUG_TASK_01": {
+      "severity": "easy",
+      "category": "status_code",
+      "owasp": "API8:2023 Security Misconfiguration",
+      "description": "GET /tasks/{id} returns 200 with null for non-existent task",
+      "recommendation": "Return 404 Not Found for non-existent resources"
+    },
+    "BUG_TASK_02": {
+      "severity": "easy",
+      "category": "validation",
+      "owasp": "API8:2023 Security Misconfiguration",
+      "description": "POST /tasks with missing title returns 500 instead of 400",
+      "recommendation": "Validate required fields and return 400/422 with descriptive error"
+    },
+    "BUG_TASK_03": {
+      "severity": "easy",
+      "category": "validation",
+      "owasp": "API8:2023 Security Misconfiguration",
+      "description": "GET /tasks?page=-1 returns 200 instead of 400",
+      "recommendation": "Validate pagination parameters: page >= 1, limit > 0"
+    },
+    "BUG_TASK_04": {
+      "severity": "medium",
+      "category": "validation",
+      "owasp": "API8:2023 Security Misconfiguration",
+      "description": "PUT /tasks/{id} accepts invalid email format",
+      "recommendation": "Validate email format with regex before accepting"
+    },
+    "BUG_TASK_05": {
+      "severity": "medium",
+      "category": "status_code",
+      "owasp": "API8:2023 Security Misconfiguration",
+      "description": "DELETE /tasks/{id} returns 200 for non-existent task",
+      "recommendation": "Check resource existence before deletion, return 404 if missing"
+    },
+    "BUG_TASK_06": {
+      "severity": "medium",
+      "category": "validation",
+      "owasp": "API4:2023 Unrestricted Resource Consumption",
+      "description": "No pagination cap on limit parameter",
+      "recommendation": "Cap pagination limit at 100, reject values above maximum"
+    },
+    "BUG_TASK_07": {
+      "severity": "hard",
+      "category": "security",
+      "owasp": "API1:2023 Broken Object Level Authorization",
+      "description": "BOLA: any user can access any task",
+      "recommendation": "Verify resource ownership: check task.owner_id matches authenticated user"
+    },
+    "BUG_TASK_08": {
+      "severity": "hard",
+      "category": "validation",
+      "owasp": "API4:2023 Unrestricted Resource Consumption",
+      "description": "Long title causes 500 error",
+      "recommendation": "Add input length validation: title max 200 chars"
+    },
+    "BUG_TASK_09": {
+      "severity": "hard",
+      "category": "security",
+      "owasp": "API8:2023 Security Misconfiguration",
+      "description": "SQL injection payload stored verbatim",
+      "recommendation": "Sanitize user input before storage, escape HTML/SQL special characters"
+    },
+    "BUG_USER_01": {
+      "severity": "medium",
+      "category": "validation",
+      "owasp": "API8:2023 Security Misconfiguration",
+      "description": "POST /users accepts invalid email",
+      "recommendation": "Validate email format server-side before creating user"
+    },
+    "BUG_USER_02": {
+      "severity": "medium",
+      "category": "security",
+      "owasp": "API3:2023 Broken Object Property Level Authorization",
+      "description": "Response exposes password hash",
+      "recommendation": "Never return sensitive fields (password_hash) in API responses"
+    },
+    "BUG_AUTH_01": {
+      "severity": "hard",
+      "category": "security",
+      "owasp": "API1:2023 Broken Object Level Authorization",
+      "description": "Broken authorization: cross-user token access",
+      "recommendation": "Enforce ownership check on all write operations (PUT/DELETE)"
+    },
+    "BUG_AUTH_02": {
+      "severity": "medium",
+      "category": "security",
+      "owasp": "API2:2023 Broken Authentication",
+      "description": "Empty password login succeeds",
+      "recommendation": "Validate password is non-empty and verify against stored hash"
+    }
+  }
+}

eval_trained.py ADDED Viewed

	@@ -0,0 +1,141 @@

+#!/usr/bin/env python3
+"""
+Re-evaluate the trained GRPO model without re-training.
+Usage:
+    python eval_trained.py
+    python eval_trained.py --checkpoint ./checkpoints/grpo_api_tester
+"""
+import argparse
+import os
+import sys
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+import logging
+logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+# Suppress noisy logs
+for _noisy in ["httpx", "httpcore", "urllib3", "huggingface_hub", "filelock"]:
+    logging.getLogger(_noisy).setLevel(logging.WARNING)
+logger = logging.getLogger(__name__)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--checkpoint",
+        default="./checkpoints/grpo_api_tester",
+        help="Path to the trained model checkpoint",
+    )
+    parser.add_argument(
+        "--base-model",
+        default="Qwen/Qwen3-1.7B",
+        help="Base model (needed if checkpoint is LoRA-only)",
+    )
+    parser.add_argument(
+        "--max-steps",
+        type=int,
+        default=25,
+        help="Max actions per task during evaluation",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=9999,
+        help="Random seed for evaluation",
+    )
+    args = parser.parse_args()
+    print(f"\n{'='*60}")
+    print(f"  Re-evaluating trained model")
+    print(f"{'='*60}")
+    print(f"  Checkpoint: {args.checkpoint}")
+    print(f"  Base model: {args.base_model}")
+    print(f"  Max steps:  {args.max_steps}")
+    print(f"  Seed:       {args.seed}")
+    print(f"{'='*60}\n")
+    import torch
+    from transformers import AutoModelForCausalLM, AutoTokenizer
+    from peft import PeftModel
+    # Detect device
+    if torch.cuda.is_available():
+        device = "cuda"
+        dtype = torch.bfloat16
+        print(f"  GPU: {torch.cuda.get_device_name(0)}")
+    else:
+        device = "cpu"
+        dtype = torch.float32
+        print("  WARNING: No GPU — eval will be slow")
+    # Load tokenizer (from base model is fine)
+    print(f"  Loading tokenizer from {args.base_model}...", flush=True)
+    tokenizer = AutoTokenizer.from_pretrained(args.base_model, trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    # Load base model
+    print(f"  Loading base model {args.base_model}...", flush=True)
+    base_model = AutoModelForCausalLM.from_pretrained(
+        args.base_model,
+        trust_remote_code=True,
+        torch_dtype=dtype,
+        device_map="auto",
+    )
+    # Load LoRA adapter from checkpoint
+    print(f"  Loading LoRA adapter from {args.checkpoint}...", flush=True)
+    try:
+        model = PeftModel.from_pretrained(base_model, args.checkpoint)
+        # Merge LoRA into base for faster inference
+        print(f"  Merging LoRA into base...", flush=True)
+        model = model.merge_and_unload()
+        print(f"  Model loaded successfully.", flush=True)
+    except Exception as exc:
+        print(f"  WARNING: Failed to load LoRA adapter: {exc}", flush=True)
+        print(f"  Using base model without LoRA.", flush=True)
+        model = base_model
+    # Run evaluation on all 3 tasks
+    from training.evaluate import run_rollout
+    print(f"\n{'='*60}")
+    print(f"  Running evaluation on all tasks...")
+    print(f"{'='*60}\n")
+    results = {}
+    for task_id in ["basic_validation", "edge_cases", "security_workflows"]:
+        print(f"\n--- Task: {task_id} ---")
+        result = run_rollout(
+            model, tokenizer,
+            task_id=task_id,
+            seed=args.seed,
+            max_steps=args.max_steps,
+        )
+        results[task_id] = result
+        print(f"  reward={result['total_reward']:.3f}, "
+              f"bugs={result['bugs_found']}/{result['total_bugs']}, "
+              f"coverage={result['coverage_pct']:.1f}%")
+    # Print summary
+    print(f"\n{'='*60}")
+    print(f"  RESULTS")
+    print(f"{'='*60}")
+    print(f"{'Task':<25} {'Reward':<10} {'Bugs':<10} {'Coverage':<10}")
+    print(f"{'-'*60}")
+    for task_id, r in results.items():
+        print(f"{task_id:<25} {r['total_reward']:<10.3f} "
+              f"{r['bugs_found']}/{r['total_bugs']:<8} "
+              f"{r['coverage_pct']:<10.1f}%")
+    print(f"{'='*60}\n")
+    avg = sum(r["total_reward"] for r in results.values()) / len(results)
+    print(f"  Average reward: {avg:.3f}")
+if __name__ == "__main__":
+    main()

gradio_app.py ADDED Viewed

	@@ -0,0 +1,627 @@

+#!/usr/bin/env python3
+"""
+Gradio UI for the API Testing Environment.
+"""
+import json
+import os
+import time
+import argparse
+from dataclasses import dataclass, field
+from typing import Optional
+import gradio as gr
+from models import APITestAction, APITestObservation, HTTPMethod
+from server.environment import APITestEnvironment, TASKS, API_SPEC
+@dataclass
+class SessionState:
+    env: APITestEnvironment = field(default_factory=APITestEnvironment)
+    initialized: bool = False
+    task_id: str = ""
+    step_log: list[dict] = field(default_factory=list)
+    total_reward: float = 0.0
+    last_obs: Optional[APITestObservation] = None
+def new_session():
+    return SessionState()
+# =====================================================================
+# Core logic
+# =====================================================================
+def _generate_report(bug_ids, action_history):
+    """Generate OWASP bug bounty report from discovered bugs."""
+    from server.graders import generate_bug_report
+    return generate_bug_report(bug_ids, action_history)
+def reset_env(task_id, state):
+    if not state:
+        state = new_session()
+    obs = state.env.reset(task_id=task_id)
+    state.initialized = True
+    state.task_id = task_id
+    state.step_log = []
+    state.total_reward = 0.0
+    state.last_obs = obs
+    t = TASKS[task_id]
+    return (
+        state,
+        f"Environment reset. Task: **{task_id}** ({t['difficulty']})\n\nMax steps: {t['max_steps']} | Bugs to find: {t['total_bugs']}",
+        obs.feedback,
+        "",
+        format_reward_display(0, 0, {}),
+        f"0 / {t['total_bugs']}",
+        format_coverage(obs.coverage_summary),
+        "",
+        f"0 / {t['max_steps']}",
+        "No bugs found yet.",
+        "No bugs found yet. Send requests to discover vulnerabilities.",
+        "No tokens acquired yet.",
+        "No resources created yet.",
+    )
+def send_request(method, endpoint, headers_str, params_str, body_str, expected_status, state):
+    if not state or not state.initialized:
+        return (state, "Environment not initialized. Click 'Reset' first.", "", "", "", "", "", "", "", "", "", "")
+    try:
+        headers = json.loads(headers_str) if headers_str.strip() else {}
+    except json.JSONDecodeError:
+        return (state, "Invalid JSON in headers.", "", "", "", "", "", "", "", "", "", "")
+    try:
+        query_params = json.loads(params_str) if params_str.strip() else {}
+    except json.JSONDecodeError:
+        return (state, "Invalid JSON in query params.", "", "", "", "", "", "", "", "", "", "")
+    try:
+        body = json.loads(body_str) if body_str.strip() else None
+    except json.JSONDecodeError:
+        return (state, "Invalid JSON in body.", "", "", "", "", "", "", "", "", "", "")
+    exp = int(expected_status) if expected_status.strip() else None
+    action = APITestAction(
+        method=HTTPMethod(method), endpoint=endpoint,
+        headers=headers, query_params=query_params,
+        body=body, expected_status=exp,
+    )
+    obs = state.env.step(action)
+    reward = obs.reward or 0.0
+    state.total_reward += reward
+    state.last_obs = obs
+    resp_body = obs.response_body
+    if isinstance(resp_body, (dict, list)):
+        resp_str = json.dumps(resp_body, indent=2)
+    else:
+        resp_str = str(resp_body)
+    state.step_log.append({
+        "step": obs.steps_taken, "method": method, "endpoint": endpoint,
+        "status": obs.status_code, "reward": round(reward, 4), "bugs": obs.bugs_found_so_far,
+    })
+    breakdown = obs.metadata.get("reward_breakdown", {})
+    reward_detail = format_reward_display(reward, state.total_reward, breakdown)
+    t = TASKS[state.task_id]
+    es = state.env.state
+    status = ""
+    if obs.done:
+        status = (
+            f"\n\n**EPISODE COMPLETE**\n\n"
+            f"Final Score: {reward:.4f}\n"
+            f"Bugs: {obs.bugs_found_so_far}/{t['total_bugs']}\n"
+            f"Steps: {obs.steps_taken}/{obs.max_steps}"
+        )
+    return (
+        state,
+        obs.feedback + status,
+        f"**{obs.status_code}** — {obs.response_time_ms:.1f}ms\n\n```json\n{resp_str}\n```",
+        reward_detail,
+        f"{obs.bugs_found_so_far} / {t['total_bugs']}",
+        format_coverage(obs.coverage_summary),
+        format_log(state.step_log),
+        f"{obs.steps_taken} / {obs.max_steps}" + (" (DONE)" if obs.done else ""),
+        format_bug_list(es.bugs_found_ids),
+        _generate_report(es.bugs_found_ids, state.step_log),
+        format_auth_tokens(obs.auth_tokens),
+        format_resources(obs.known_resource_ids),
+    )
+def apply_quick_action(action_name, _state):
+    quick_actions = {
+        "GET /tasks": ("GET", "/tasks", "{}", "{}", "", "200"),
+        "GET /users": ("GET", "/users", "{}", "{}", "", "200"),
+        "GET /tasks/1": ("GET", "/tasks/1", "{}", "{}", "", "200"),
+        "GET /tasks/999999 (bug hunt)": ("GET", "/tasks/999999", "{}", "{}", "", "404"),
+        "POST create task": ("POST", "/tasks", "{}", "{}", '{"title": "Test Task", "description": "Created via UI"}', "201"),
+        "POST missing title (bug hunt)": ("POST", "/tasks", "{}", "{}", '{"description": "no title"}', "400"),
+        "Login as alice": ("POST", "/auth/login", "{}", "{}", '{"username": "alice", "password": "pass"}', "200"),
+        "Login as bob": ("POST", "/auth/login", "{}", "{}", '{"username": "bob", "password": "pass"}', "200"),
+        "Login empty pwd (bug hunt)": ("POST", "/auth/login", "{}", "{}", '{"username": "alice", "password": ""}', "401"),
+        "Negative page (bug hunt)": ("GET", "/tasks", "{}", '{"page": -1, "limit": 10}', "", "400"),
+        "Huge limit (bug hunt)": ("GET", "/tasks", "{}", '{"limit": 999999}', "", "200"),
+        "Invalid email PUT (bug hunt)": ("PUT", "/tasks/1", "{}", "{}", '{"assignee_email": "not-an-email"}', "422"),
+        "DELETE non-existent (bug hunt)": ("DELETE", "/tasks/99999", "{}", "{}", "", "404"),
+        "Create user invalid email (bug)": ("POST", "/users", "{}", "{}", '{"username": "baduser", "email": "nope", "password": "x"}', "422"),
+        "SQL injection test": ("POST", "/tasks", "{}", "{}", '{"title": "test\'; DROP TABLE tasks;--"}', "201"),
+        "Long title crash (bug hunt)": ("POST", "/tasks", "{}", "{}", '{"title": "' + "A" * 6000 + '"}', "400"),
+    }
+    if action_name and action_name in quick_actions:
+        return quick_actions[action_name]
+    return [gr.update()] * 6
+def run_baseline_agent(agent_type, state):
+    if not state or not state.initialized:
+        yield state, "Environment not initialized.", "", "", "", "", "", "", "", "", "", ""
+        return
+    from training.agents import RandomAgent, SequentialAgent, SmartAgent
+    agents = {"random": RandomAgent, "sequential": SequentialAgent, "smart": SmartAgent}
+    agent = agents[agent_type]()
+    t = TASKS[state.task_id]
+    obs = state.env.reset(task_id=state.task_id)
+    state.step_log = []
+    state.total_reward = 0.0
+    state.last_obs = obs
+    while not obs.done:
+        obs_dict = {
+            "status_code": obs.status_code, "response_body": obs.response_body,
+            "feedback": obs.feedback, "bugs_found_so_far": obs.bugs_found_so_far,
+            "coverage_summary": obs.coverage_summary, "known_resource_ids": obs.known_resource_ids,
+            "auth_tokens": obs.auth_tokens, "steps_taken": obs.steps_taken, "max_steps": obs.max_steps,
+        }
+        action = agent.act(obs_dict)
+        obs = state.env.step(action)
+        reward = obs.reward or 0.0
+        state.total_reward += reward
+        state.last_obs = obs
+        ms = action.method.value if hasattr(action.method, "value") else str(action.method)
+        state.step_log.append({
+            "step": obs.steps_taken, "method": ms, "endpoint": action.endpoint,
+            "status": obs.status_code, "reward": round(reward, 4), "bugs": obs.bugs_found_so_far,
+        })
+        resp_body = obs.response_body
+        if isinstance(resp_body, (dict, list)):
+            resp_str = json.dumps(resp_body, indent=2)
+        else:
+            resp_str = str(resp_body)
+        breakdown = obs.metadata.get("reward_breakdown", {})
+        reward_detail = format_reward_display(reward, state.total_reward, breakdown)
+        es = state.env.state
+        done_text = ""
+        if obs.done:
+            done_text = f"\n\n**EPISODE COMPLETE** — Final Score: {reward:.4f} | Bugs: {obs.bugs_found_so_far}/{t['total_bugs']}"
+        yield (
+            state,
+            f"[{agent_type}] {ms} {action.endpoint} -> {obs.status_code}{done_text}",
+            f"**{obs.status_code}**\n```json\n{resp_str[:500]}\n```",
+            reward_detail,
+            f"{obs.bugs_found_so_far} / {t['total_bugs']}",
+            format_coverage(obs.coverage_summary),
+            format_log(state.step_log),
+            f"{obs.steps_taken} / {obs.max_steps}" + (" (DONE)" if obs.done else ""),
+            format_bug_list(es.bugs_found_ids),
+            _generate_report(es.bugs_found_ids, state.step_log),
+            format_auth_tokens(obs.auth_tokens),
+            format_resources(obs.known_resource_ids),
+        )
+        time.sleep(0.3)
+# =====================================================================
+# Formatters
+# =====================================================================
+def format_reward_display(step_reward, cumulative, breakdown):
+    """Render reward metrics as styled HTML with explanations."""
+    components = [
+        ("Coverage", breakdown.get("coverage", 0),
+         "Reward for testing new endpoints and methods"),
+        ("Validity", breakdown.get("validity", 0),
+         "Reward for sending well-formed requests that return expected status codes"),
+        ("Bug", breakdown.get("bug_discovery", 0),
+         "Bonus for discovering a new bug in the API"),
+        ("Explore", breakdown.get("exploration", 0),
+         "Reward for trying new parameter combinations and edge cases"),
+        ("Penalty", breakdown.get("penalty", 0),
+         "Deduction for repeated or invalid requests"),
+    ]
+    bars = []
+    for label, value, tip in components:
+        val_color = "#16a34a" if value > 0 else "#dc2626" if value < 0 else "inherit"
+        bars.append(
+            f'<div style="display:flex;justify-content:space-between;align-items:center;'
+            f'padding:2px 0;font-size:0.82em;" title="{tip}">'
+            f'<span style="opacity:0.6;cursor:help;border-bottom:1px dotted currentColor;">'
+            f'{label}</span>'
+            f'<span style="color:{val_color};font-family:monospace;font-weight:600;">'
+            f'{value:+.3f}</span></div>'
+        )
+    cum_color = "#16a34a" if cumulative > 0 else "#dc2626" if cumulative < 0 else "inherit"
+    step_color = "#16a34a" if step_reward > 0 else "#dc2626" if step_reward < 0 else "inherit"
+    return (
+        f'<div style="display:flex;gap:16px;margin-bottom:8px;">'
+        f'<div style="flex:1;text-align:center;padding:6px;background:rgba(128,128,128,0.1);'
+        f'border-radius:8px;">'
+        f'<div style="font-size:0.72em;opacity:0.55;">STEP REWARD</div>'
+        f'<div style="font-size:1.3em;font-weight:700;color:{step_color};">'
+        f'{step_reward:+.4f}</div></div>'
+        f'<div style="flex:1;text-align:center;padding:6px;background:rgba(128,128,128,0.1);'
+        f'border-radius:8px;">'
+        f'<div style="font-size:0.72em;opacity:0.55;">CUMULATIVE</div>'
+        f'<div style="font-size:1.3em;font-weight:700;color:{cum_color};">'
+        f'{cumulative:.4f}</div></div></div>'
+        f'<div style="border:1px solid rgba(128,128,128,0.2);border-radius:8px;padding:6px 10px;">'
+        f'<div style="font-size:0.72em;opacity:0.5;margin-bottom:4px;">'
+        f'REWARD BREAKDOWN '
+        f'<span title="How the reward for the last step was calculated"'
+        f' style="cursor:help;">&#9432;</span></div>'
+        + "".join(bars)
+        + "</div>"
+    )
+def format_coverage(summary):
+    if not summary:
+        return "No data"
+    pct = summary.get("coverage_pct", 0)
+    tested = summary.get("endpoints_tested", 0)
+    total = summary.get("total_endpoints", 0)
+    pairs = summary.get("method_endpoint_pairs", 0)
+    codes = summary.get("status_codes_seen", [])
+    color = "#dc2626" if pct < 30 else "#d97706" if pct < 70 else "#16a34a"
+    bar_html = (
+        f'<div style="display:flex;align-items:center;gap:8px;margin:4px 0;">'
+        f'<div style="flex:1;background:rgba(128,128,128,0.15);border-radius:6px;height:14px;overflow:hidden;">'
+        f'<div style="width:{pct:.1f}%;height:100%;background:{color};border-radius:6px;'
+        f'transition:width 0.3s ease;"></div></div>'
+        f'<span style="font-weight:700;min-width:48px;text-align:right;">{pct:.1f}%</span></div>'
+    )
+    code_pills = ""
+    for c in codes:
+        cc = "#16a34a" if 200 <= c < 300 else "#d97706" if 300 <= c < 400 else "#dc2626"
+        code_pills += (
+            f'<span style="background:{cc}18;color:{cc};padding:1px 7px;border-radius:10px;'
+            f'font-size:0.78em;font-weight:600;margin-right:4px;">{c}</span>'
+        )
+    return (
+        f"{bar_html}"
+        f'<div style="display:flex;gap:10px;margin:6px 0;font-size:0.82em;">'
+        f'<div style="flex:1;text-align:center;padding:4px;background:rgba(128,128,128,0.1);border-radius:6px;"'
+        f' title="How many unique API endpoints have been called">'
+        f'<div style="font-size:0.72em;opacity:0.5;">ENDPOINTS</div>'
+        f'<div style="font-weight:700;">{tested}/{total}</div></div>'
+        f'<div style="flex:1;text-align:center;padding:4px;background:rgba(128,128,128,0.1);border-radius:6px;"'
+        f' title="Unique combinations of HTTP method + endpoint path tested">'
+        f'<div style="font-size:0.72em;opacity:0.5;">METHOD+PATH</div>'
+        f'<div style="font-weight:700;">{pairs}</div></div></div>'
+        f'<div style="margin-top:4px;" title="HTTP status codes received from the API so far">'
+        f'<span style="font-size:0.72em;opacity:0.5;">STATUS CODES SEEN </span>'
+        f'{code_pills}</div>'
+    )
+def format_log(log):
+    if not log:
+        return (
+            '<div style="opacity:0.55;font-size:0.85em;">'
+            "Each row shows an API request the agent made, the HTTP status it got back, "
+            "and the reward earned. Green = positive reward, red = penalty."
+            "</div>"
+        )
+    method_colors = {
+        "GET": "#2563eb", "POST": "#16a34a", "PUT": "#d97706",
+        "DELETE": "#dc2626", "PATCH": "#9333ea",
+    }
+    rows = []
+    for entry in log[-20:]:
+        m = entry["method"]
+        mcol = method_colors.get(m, "#6b7280")
+        r = entry["reward"]
+        rcol = "#16a34a" if r > 0 else "#dc2626" if r < 0 else "inherit"
+        bug_tag = (
+            '<span style="background:#92400e;color:#fef08a;padding:0 5px;border-radius:4px;'
+            'font-size:0.7em;margin-left:4px;">BUG FOUND</span>'
+        ) if r > 0.2 else ""
+        status = entry["status"]
+        scol = "#16a34a" if 200 <= status < 300 else "#d97706" if 300 <= status < 400 else "#dc2626"
+        rows.append(
+            f'<div style="display:flex;align-items:center;gap:6px;padding:3px 0;'
+            f'border-bottom:1px solid rgba(128,128,128,0.1);font-size:0.82em;">'
+            f'<span style="opacity:0.45;min-width:20px;text-align:right;">{entry["step"]}</span>'
+            f'<span style="background:{mcol}18;color:{mcol};padding:1px 6px;border-radius:4px;'
+            f'font-weight:600;font-size:0.8em;min-width:52px;text-align:center;">{m}</span>'
+            f'<span style="flex:1;overflow:hidden;text-overflow:ellipsis;'
+            f'white-space:nowrap;">{entry["endpoint"]}</span>'
+            f'<span style="color:{scol};font-weight:600;min-width:28px;text-align:right;">{status}</span>'
+            f'<span style="color:{rcol};min-width:52px;text-align:right;font-family:monospace;'
+            f'font-size:0.85em;">{r:+.3f}</span>{bug_tag}</div>'
+        )
+    omitted = ""
+    if len(log) > 20:
+        omitted = (
+            f'<div style="opacity:0.45;font-size:0.78em;padding:4px 0;text-align:center;">'
+            f'... {len(log) - 20} earlier steps not shown</div>'
+        )
+    header = (
+        '<div style="opacity:0.55;font-size:0.78em;margin-bottom:6px;">'
+        "API requests made by the agent. Each row: step number, HTTP method, "
+        "endpoint, status code, and reward earned.</div>"
+        '<div style="display:flex;gap:6px;padding:2px 0 6px;border-bottom:1px solid rgba(128,128,128,0.2);'
+        'font-size:0.75em;opacity:0.5;">'
+        '<span style="min-width:20px;text-align:right;">#</span>'
+        '<span style="min-width:52px;text-align:center;">Method</span>'
+        '<span style="flex:1;">Endpoint</span>'
+        '<span style="min-width:28px;text-align:right;">Status</span>'
+        '<span style="min-width:52px;text-align:right;">Reward</span></div>'
+    )
+    return header + omitted + "\n".join(rows)
+def format_bug_list(bug_ids):
+    if not bug_ids:
+        return "No bugs found yet."
+    from server.bug_detector import BugDetector
+    detector = BugDetector("security_workflows")
+    severity_colors = {
+        "easy": "#16a34a",
+        "medium": "#d97706",
+        "hard": "#dc2626",
+    }
+    cards = []
+    for bid in sorted(bug_ids):
+        bug = detector.bugs.get(bid)
+        if bug:
+            fg = severity_colors.get(bug.severity, "#6b7280")
+            owasp_badge = f' | {bug.owasp.split(" ")[0]}' if bug.owasp else ""
+            cards.append(
+                f'<div style="border:1px solid {fg}40;border-radius:8px;padding:8px 10px;'
+                f'margin-bottom:6px;background:{fg}0d;">'
+                f'<div style="display:flex;justify-content:space-between;align-items:center;">'
+                f'<span style="font-weight:700;font-size:0.85em;">{bid}</span>'
+                f'<span style="background:{fg};color:#fff;padding:1px 8px;border-radius:10px;'
+                f'font-size:0.75em;font-weight:600;">{bug.severity.upper()}{owasp_badge}</span></div>'
+                f'<div style="margin-top:4px;font-size:0.85em;opacity:0.7;">'
+                f'{bug.description}</div>'
+                f'<div style="margin-top:2px;font-size:0.78em;opacity:0.5;font-style:italic;">'
+                f'{bug.owasp}</div></div>'
+            )
+    return "\n".join(cards)
+def format_auth_tokens(tokens):
+    if not tokens:
+        return (
+            '<div style="opacity:0.5;font-size:0.85em;">'
+            "No tokens yet. Login via <code>POST /auth/login</code> to get auth tokens "
+            "for testing protected endpoints.</div>"
+        )
+    cards = []
+    for user, token in tokens.items():
+        cards.append(
+            f'<div style="display:flex;align-items:center;gap:8px;padding:4px 0;'
+            f'border-bottom:1px solid rgba(128,128,128,0.1);font-size:0.85em;">'
+            f'<span style="background:#2563eb18;color:#2563eb;padding:1px 8px;border-radius:10px;'
+            f'font-weight:600;font-size:0.8em;">{user}</span>'
+            f'<code style="opacity:0.55;font-size:0.82em;">{token[:20]}...</code></div>'
+        )
+    return (
+        '<div style="font-size:0.72em;opacity:0.5;margin-bottom:4px;"'
+        ' title="Auth tokens obtained by logging in. Use these in the Authorization header.">'
+        "AUTHENTICATED USERS</div>"
+        + "".join(cards)
+    )
+def format_resources(ids):
+    if not ids:
+        return (
+            '<div style="opacity:0.5;font-size:0.85em;">'
+            "No resources created. Use POST endpoints to create tasks or users "
+            "and track their IDs here.</div>"
+        )
+    sections = []
+    type_colors = {"tasks": "#d97706", "users": "#2563eb"}
+    for rtype, id_list in ids.items():
+        color = type_colors.get(rtype, "#6b7280")
+        ids_str = ", ".join(str(i) for i in id_list) if isinstance(id_list, list) else str(id_list)
+        sections.append(
+            f'<div style="padding:4px 0;border-bottom:1px solid rgba(128,128,128,0.1);font-size:0.85em;">'
+            f'<span style="background:{color}18;color:{color};padding:1px 8px;border-radius:10px;'
+            f'font-weight:600;font-size:0.8em;text-transform:uppercase;">{rtype}</span>'
+            f'<span style="margin-left:8px;opacity:0.7;">IDs: {ids_str}</span></div>'
+        )
+    return (
+        '<div style="font-size:0.72em;opacity:0.5;margin-bottom:4px;"'
+        ' title="Resources created during this episode. Use these IDs in GET/PUT/DELETE requests.">'
+        "CREATED RESOURCES</div>"
+        + "".join(sections)
+    )
+def format_endpoints():
+    lines = []
+    for ep in API_SPEC:
+        lines.append(f"**{ep['method']}** `{ep['path']}` — {ep.get('summary', '')}")
+    return "\n\n".join(lines)
+# =====================================================================
+# UI
+# =====================================================================
+def build_ui():
+    with gr.Blocks(title="API Testing Environment") as demo:
+        session = gr.State(value=new_session())
+        gr.Markdown(
+            "# API Testing Environment\n"
+            "An OpenEnv RL environment that trains AI agents to become automated **API security testers**. "
+            "A simulated API server with **13 hidden vulnerabilities** mapped to the **OWASP API Security Top 10** is provided. "
+            "Send HTTP requests, earn rewards for finding bugs and covering endpoints, and generate a **bug bounty report** at episode end. "
+            "Use **Manual Testing** to craft requests yourself, or run a **Baseline Agent** to watch an automated strategy."
+        )
+        with gr.Row():
+            # ── Left Panel ──
+            with gr.Column(scale=1):
+                gr.Markdown("### Environment Control")
+                task_dropdown = gr.Dropdown(choices=list(TASKS.keys()), value="basic_validation", label="Select Task")
+                reset_btn = gr.Button("Reset Environment", variant="primary", size="lg")
+                gr.Markdown(
+                    '<span style="font-size:0.8em;opacity:0.55;">'
+                    "Switch task or click Reset to start a fresh episode. "
+                    "Resets all scores, bugs, and step count.</span>"
+                )
+                status_box = gr.Markdown("Initializing...")
+                gr.Markdown("---")
+                gr.Markdown("### Scoreboard")
+                gr.Markdown(
+                    '<span style="font-size:0.78em;opacity:0.55;">'
+                    "Tracks your testing progress. Steps are API calls you've made; "
+                    "bugs are issues discovered in the API; reward measures how well "
+                    "the agent is testing.</span>"
+                )
+                with gr.Row():
+                    step_display = gr.Markdown("0 / 25", label="Steps")
+                    bug_display = gr.Markdown("0 / 3", label="Bugs")
+                reward_display = gr.Markdown(format_reward_display(0, 0, {}), label="Reward")
+                coverage_display = gr.Markdown("No data", label="Coverage")
+                gr.Markdown("---")
+                gr.Markdown("### Session Context")
+                gr.Markdown(
+                    '<span style="font-size:0.78em;opacity:0.55;">'
+                    "Tokens and resources gathered during this episode. "
+                    "Use tokens to test auth-protected endpoints and resource IDs for "
+                    "GET/PUT/DELETE requests.</span>"
+                )
+                auth_display = gr.Markdown(format_auth_tokens({}))
+                resource_display = gr.Markdown(format_resources({}))
+                gr.Markdown("---")
+                with gr.Accordion("API Specification", open=False):
+                    gr.Markdown(format_endpoints())
+            # ── Center Panel ──
+            with gr.Column(scale=2):
+                with gr.Tabs():
+                    with gr.Tab("Manual Testing"):
+                        gr.Markdown("### Craft Your Request")
+                        with gr.Row():
+                            method_input = gr.Dropdown(
+                                choices=["GET", "POST", "PUT", "DELETE", "PATCH"],
+                                value="GET", label="Method", scale=1,
+                            )
+                            endpoint_input = gr.Textbox(value="/tasks", label="Endpoint", placeholder="/tasks, /users/1, /auth/login", scale=3)
+                            expected_input = gr.Textbox(value="200", label="Expected Status", placeholder="200", scale=1)
+                        with gr.Row():
+                            headers_input = gr.Textbox(value="{}", label="Headers (JSON)", placeholder='{"Authorization": "Bearer ..."}', lines=1)
+                            params_input = gr.Textbox(value="{}", label="Query Params (JSON)", placeholder='{"page": 1, "limit": 10}', lines=1)
+                        body_input = gr.Textbox(value="", label="Request Body (JSON)", placeholder='{"title": "My Task", "description": "..."}', lines=3)
+                        send_btn = gr.Button("Send Request", variant="primary", size="lg")
+                        gr.Markdown("### Quick Actions")
+                        quick_actions = gr.Dropdown(
+                            choices=[
+                                "GET /tasks", "GET /users", "GET /tasks/1",
+                                "GET /tasks/999999 (bug hunt)", "POST create task",
+                                "POST missing title (bug hunt)", "Login as alice", "Login as bob",
+                                "Login empty pwd (bug hunt)", "Negative page (bug hunt)",
+                                "Huge limit (bug hunt)", "Invalid email PUT (bug hunt)",
+                                "DELETE non-existent (bug hunt)", "Create user invalid email (bug)",
+                                "SQL injection test", "Long title crash (bug hunt)",
+                            ],
+                            label="Quick Actions", value=None,
+                        )
+                        quick_btn = gr.Button("Load Quick Action", variant="secondary")
+                    with gr.Tab("Run Baseline Agent"):
+                        gr.Markdown("### Automated Agents\nWatch a baseline agent test the API step by step.")
+                        agent_dropdown = gr.Dropdown(choices=["random", "sequential", "smart"], value="smart", label="Agent Type")
+                        run_agent_btn = gr.Button("Run Agent", variant="primary", size="lg")
+                gr.Markdown("---")
+                gr.Markdown("### Response")
+                response_display = gr.Markdown("")
+                gr.Markdown("### Feedback")
+                feedback_display = gr.Markdown("")
+            # ── Right Panel ──
+            with gr.Column(scale=1):
+                with gr.Tabs():
+                    with gr.Tab("Discovered Bugs"):
+                        bug_list_display = gr.Markdown("No bugs found yet.")
+                    with gr.Tab("Bug Report"):
+                        gr.Markdown("*Auto-generated OWASP security report. Populates as bugs are found.*")
+                        bug_report_display = gr.Markdown("No bugs found yet. Send requests to discover vulnerabilities.")
+                    with gr.Tab("Activity Log"):
+                        log_display = gr.Markdown("No steps yet.")
+        # ── Wiring ──
+        reset_outputs = [
+            session, status_box, feedback_display, response_display,
+            reward_display, bug_display, coverage_display, log_display,
+            step_display, bug_list_display, bug_report_display, auth_display, resource_display,
+        ]
+        step_outputs = [
+            session, feedback_display, response_display, reward_display,
+            bug_display, coverage_display, log_display, step_display,
+            bug_list_display, bug_report_display, auth_display, resource_display,
+        ]
+        reset_btn.click(fn=reset_env, inputs=[task_dropdown, session], outputs=reset_outputs)
+        send_btn.click(
+            fn=send_request,
+            inputs=[method_input, endpoint_input, headers_input, params_input, body_input, expected_input, session],
+            outputs=step_outputs,
+        )
+        quick_btn.click(
+            fn=apply_quick_action, inputs=[quick_actions, session],
+            outputs=[method_input, endpoint_input, headers_input, params_input, body_input, expected_input],
+        )
+        run_agent_btn.click(fn=run_baseline_agent, inputs=[agent_dropdown, session], outputs=step_outputs)
+        # Auto-reset on page load so users can start testing immediately
+        demo.load(fn=reset_env, inputs=[task_dropdown, session], outputs=reset_outputs)
+    return demo
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=int(os.getenv("GRADIO_SERVER_PORT", "7860")))
+    parser.add_argument("--host", default="0.0.0.0")
+    parser.add_argument("--share", action="store_true")
+    args = parser.parse_args()
+    build_ui().launch(server_name=args.host, server_port=args.port, share=args.share)

inference.py ADDED Viewed

	@@ -0,0 +1,432 @@

+#!/usr/bin/env python3
+"""
+inference.py — OpenEnv API Testing Environment baseline inference script.
+Runs an LLM agent against the API Testing Environment for all 3 tasks
+(basic_validation -> edge_cases -> security_workflows) and emits the
+mandatory [START]/[STEP]/[END] stdout format used by the OpenEnv judging
+pipeline.
+Required env vars (per OpenEnv submission spec):
+    API_BASE_URL   The OpenAI-compatible LLM endpoint
+    MODEL_NAME     The model identifier to use for inference
+    HF_TOKEN       Bearer token for the LLM endpoint (or API_KEY)
+Optional env vars:
+    IMAGE_NAME            Docker image to spin up the env via from_docker_image()
+    LOCAL_IMAGE_NAME      Alias for IMAGE_NAME
+    ENV_BASE_URL          URL of an already-running env server (e.g. http://localhost:8000)
+    INFERENCE_TASKS       Comma-separated subset of tasks to run (default: all 3)
+    INFERENCE_MAX_STEPS   Override max steps per task
+    INFERENCE_TEMPERATURE Default 0.4
+    INFERENCE_MAX_TOKENS  Default 4096 (plan completions need room for ~25 actions)
+The script uses PLAN MODE: one LLM call per task produces a complete JSON
+test plan, then the env executes each action sequentially. This matches the
+GRPO training distribution and keeps total LLM cost to 3 calls per run, so
+the script comfortably runs under 20 min on 2 vCPU / 8 GB RAM.
+Usage:
+    # Local in-process (no Docker, fastest)
+    python inference.py
+    # Against a built docker image
+    IMAGE_NAME=api-testing-env:latest python inference.py
+    # Against an already running server
+    ENV_BASE_URL=http://localhost:8000 python inference.py
+    # Against a deployed HF Space
+    ENV_BASE_URL=https://your-user-api-testing-env.hf.space python inference.py
+"""
+import json
+import os
+import sys
+import time
+import traceback
+from typing import Any, Optional
+# Make sibling modules importable when run from the repo root
+_THIS_DIR = os.path.dirname(os.path.abspath(__file__))
+if _THIS_DIR not in sys.path:
+    sys.path.insert(0, _THIS_DIR)
+# Auto-load .env file if present (for local development)
+# Judges set env vars directly so this is harmless in production
+try:
+    from dotenv import load_dotenv
+    _env_path = os.path.join(_THIS_DIR, ".env")
+    if os.path.exists(_env_path):
+        load_dotenv(_env_path)
+except ImportError:
+    pass  # python-dotenv is optional
+from openai import OpenAI
+from models import APITestAction, HTTPMethod  # noqa: E402
+from training.prompts import (  # noqa: E402
+    PLAN_SYSTEM_PROMPT,
+    format_plan_prompt,
+    parse_test_plan,
+)
+# ---------------------------------------------------------------------------
+# Config (env vars per OpenEnv spec)
+# ---------------------------------------------------------------------------
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+# Default model: must be available on the HuggingFace Inference Router.
+# Llama-3.3-70B-Instruct is reliable, follows JSON instructions well, and free.
+# Override via: MODEL_NAME=other/model python inference.py
+MODEL_NAME = os.getenv("MODEL_NAME", "meta-llama/Llama-3.3-70B-Instruct")
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+if not API_KEY:
+    print(
+        "[ERROR] No HF_TOKEN or API_KEY found in environment.\n"
+        "  Set one of:\n"
+        "    export HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx\n"
+        "  Or create a .env file in this directory with:\n"
+        "    HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx\n"
+        "  Get a token from: https://huggingface.co/settings/tokens\n"
+        "  Make sure it has 'Make calls to Inference Providers' permission.",
+        file=sys.stderr,
+    )
+    sys.exit(1)
+IMAGE_NAME = os.getenv("IMAGE_NAME") or os.getenv("LOCAL_IMAGE_NAME")
+ENV_BASE_URL = os.getenv("ENV_BASE_URL")
+BENCHMARK = "api_testing_env"
+DEFAULT_TASKS = ["basic_validation", "edge_cases", "security_workflows"]
+TASKS = [t.strip() for t in os.getenv("INFERENCE_TASKS", ",".join(DEFAULT_TASKS)).split(",") if t.strip()]
+TEMPERATURE = float(os.getenv("INFERENCE_TEMPERATURE", "0.4"))
+MAX_TOKENS = int(os.getenv("INFERENCE_MAX_TOKENS", "4096"))
+_MAX_STEPS_OVERRIDE = os.getenv("INFERENCE_MAX_STEPS")
+MAX_STEPS_OVERRIDE: Optional[int] = int(_MAX_STEPS_OVERRIDE) if _MAX_STEPS_OVERRIDE else None
+# ---------------------------------------------------------------------------
+# Strict stdout logging — these line formats are checked by the judge
+# ---------------------------------------------------------------------------
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} "
+        f"done={str(done).lower()} error={error if error else 'null'}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: list[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} "
+        f"score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
+def _action_str(action: APITestAction) -> str:
+    """Compact human-readable action label for the [STEP] line."""
+    method = action.method.value if hasattr(action.method, "value") else str(action.method)
+    return f"{method}_{action.endpoint}"
+# ---------------------------------------------------------------------------
+# LLM call — plan mode (one completion per task)
+# ---------------------------------------------------------------------------
+def get_plan_from_llm(client: OpenAI, observation) -> str:
+    """Ask the LLM for a complete JSON test plan for this task.
+    Wraps the array in {"actions": [...]} so we can use OpenAI structured
+    output mode (`response_format={"type": "json_object"}`), which forces
+    the LLM to produce valid JSON. This is much more reliable than asking
+    for a raw JSON array.
+    """
+    user_prompt = format_plan_prompt(observation)
+    # Stronger system prompt for structured output mode
+    system_prompt = (
+        PLAN_SYSTEM_PROMPT
+        + "\n\nIMPORTANT: Output a JSON object with a single key 'actions' "
+        + "containing the array of actions:\n"
+        + '{"actions": [{"method": "GET", "endpoint": "/tasks", "headers": {}, '
+        + '"query_params": {}, "body": null, "expected_status": 200}, ...]}'
+    )
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": system_prompt},
+                {"role": "user", "content": user_prompt},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            response_format={"type": "json_object"},  # forces valid JSON
+            stream=False,
+        )
+        text = (completion.choices[0].message.content or "").strip()
+        print(f"[DEBUG] LLM response length: {len(text)} chars", flush=True)
+        if len(text) > 0:
+            preview = text[:300].replace("\n", " ")
+            print(f"[DEBUG] LLM response preview: {preview}...", flush=True)
+        else:
+            print(f"[DEBUG] LLM returned EMPTY string", flush=True)
+            if hasattr(completion, "choices") and completion.choices:
+                finish_reason = getattr(completion.choices[0], "finish_reason", None)
+                print(f"[DEBUG] finish_reason: {finish_reason}", flush=True)
+        return text
+    except Exception as exc:  # noqa: BLE001
+        print(f"[DEBUG] structured-output call failed ({type(exc).__name__}: {exc}), retrying without response_format...", flush=True)
+        # Some providers don't support response_format — fall back to plain text
+        try:
+            completion = client.chat.completions.create(
+                model=MODEL_NAME,
+                messages=[
+                    {"role": "system", "content": PLAN_SYSTEM_PROMPT},
+                    {"role": "user", "content": user_prompt},
+                ],
+                temperature=TEMPERATURE,
+                max_tokens=MAX_TOKENS,
+                stream=False,
+            )
+            text = (completion.choices[0].message.content or "").strip()
+            print(f"[DEBUG] fallback LLM response length: {len(text)} chars", flush=True)
+            return text
+        except Exception as exc2:  # noqa: BLE001
+            print(f"[DEBUG] fallback LLM call failed: {type(exc2).__name__}: {exc2}", flush=True)
+            return ""
+# ---------------------------------------------------------------------------
+# Per-task scoring helper — keeps the score in [0, 1]
+# ---------------------------------------------------------------------------
+def compute_task_score(state, total_step_reward: float) -> float:
+    """Combine grader signals into a single normalized score in [0, 1].
+    The server already runs `TaskGrader.grade(...)` at episode end and adds
+    that score (already in [0, 1]) on top of the last step reward. We do
+    NOT trust the raw step rewards — those are sums of partial signals and
+    can exceed 1.0. Instead we derive the score from the published state:
+        score = 0.7 * (bugs_found / total_bugs) + 0.3 * (coverage_pct / 100)
+    which is bounded in [0, 1] and rewards both finding bugs and coverage.
+    """
+    bugs_found = getattr(state, "bugs_found", 0) or 0
+    total_bugs = getattr(state, "total_bugs", 0) or 0
+    coverage_pct = getattr(state, "coverage_pct", 0.0) or 0.0
+    bug_ratio = (bugs_found / total_bugs) if total_bugs > 0 else 0.0
+    coverage_ratio = max(0.0, min(1.0, coverage_pct / 100.0))
+    score = 0.70 * bug_ratio + 0.30 * coverage_ratio
+    return max(0.0, min(1.0, score))
+# ---------------------------------------------------------------------------
+# Environment connector — supports docker / remote / in-process
+# ---------------------------------------------------------------------------
+class _EnvHandle:
+    """Thin wrapper that exposes a uniform reset/step/state/close API.
+    Three modes, picked automatically:
+        1. IMAGE_NAME set         -> APITestEnv.from_docker_image(IMAGE_NAME)
+        2. ENV_BASE_URL set       -> APITestEnv(base_url=ENV_BASE_URL)
+        3. neither set (default)  -> APITestEnvironment() in-process
+    """
+    def __init__(self):
+        self._mode: str = ""
+        self._client = None        # remote/docker client
+        self._env = None           # in-process env
+    def open(self):
+        if IMAGE_NAME:
+            from client import APITestEnv
+            self._mode = "docker"
+            self._client = APITestEnv.from_docker_image(IMAGE_NAME)
+        elif ENV_BASE_URL:
+            from client import APITestEnv
+            self._mode = "remote"
+            self._client = APITestEnv(base_url=ENV_BASE_URL)
+            if hasattr(self._client, "connect"):
+                self._client.connect()
+        else:
+            from server.environment import APITestEnvironment
+            self._mode = "local"
+            self._env = APITestEnvironment()
+        return self
+    @property
+    def mode(self) -> str:
+        return self._mode
+    def reset(self, task_id: str, seed: int = 42):
+        if self._mode in ("docker", "remote"):
+            result = self._client.reset(task_id=task_id, seed=seed)
+            return result.observation, result
+        obs = self._env.reset(seed=seed, task_id=task_id)
+        return obs, None
+    def step(self, action: APITestAction):
+        if self._mode in ("docker", "remote"):
+            result = self._client.step(action)
+            return result.observation, result.reward or 0.0, result.done
+        obs = self._env.step(action)
+        return obs, (obs.reward or 0.0), obs.done
+    def state(self):
+        if self._mode in ("docker", "remote"):
+            return self._client.state()
+        return self._env.state
+    def close(self):
+        try:
+            if self._client is not None and hasattr(self._client, "close"):
+                self._client.close()
+        except Exception as exc:  # noqa: BLE001
+            print(f"[DEBUG] env close error: {exc}", flush=True)
+# ---------------------------------------------------------------------------
+# One full episode (one task) -> emits [START] / [STEP]* / [END]
+# ---------------------------------------------------------------------------
+def run_task(env: _EnvHandle, client: OpenAI, task_id: str, seed: int = 42) -> dict:
+    rewards: list[float] = []
+    steps_taken = 0
+    last_error: Optional[str] = None
+    score = 0.0
+    log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        obs, _ = env.reset(task_id=task_id, seed=seed)
+        max_steps = MAX_STEPS_OVERRIDE or getattr(obs, "max_steps", 25)
+        # 1) Ask the LLM for a full plan
+        plan_text = get_plan_from_llm(client, obs)
+        actions = parse_test_plan(plan_text) if plan_text else []
+        # Fallback: if parser failed but we have text, try a more lenient parse
+        if not actions and plan_text:
+            print(f"[DEBUG] {task_id}: parse_test_plan returned 0, trying lenient parse...", flush=True)
+            try:
+                import json as _json, re as _re
+                # Try to find any JSON array of objects in the text
+                cleaned = plan_text
+                if "</think>" in cleaned:
+                    cleaned = cleaned.split("</think>", 1)[-1]
+                # Find first [ and last ]
+                start = cleaned.find("[")
+                end = cleaned.rfind("]")
+                if start >= 0 and end > start:
+                    arr_str = cleaned[start:end+1]
+                    raw = _json.loads(arr_str)
+                    if isinstance(raw, list):
+                        from training.prompts import _dict_to_action
+                        for item in raw:
+                            if isinstance(item, dict) and "method" in item:
+                                a = _dict_to_action(item)
+                                if a:
+                                    actions.append(a)
+                        print(f"[DEBUG] {task_id}: lenient parse recovered {len(actions)} actions", flush=True)
+            except Exception as exc:
+                print(f"[DEBUG] {task_id}: lenient parse failed: {exc}", flush=True)
+        if not actions:
+            last_error = "no_plan_parsed"
+            print(f"[DEBUG] {task_id}: model produced 0 valid actions", flush=True)
+        actions = actions[:max_steps]
+        # 2) Execute each action and emit one [STEP] line per env.step()
+        done = False
+        for i, action in enumerate(actions, start=1):
+            if done:
+                break
+            try:
+                obs, reward, done = env.step(action)
+                rewards.append(float(reward))
+                steps_taken = i
+                log_step(step=i, action=_action_str(action), reward=reward, done=done, error=None)
+            except Exception as exc:  # noqa: BLE001
+                last_error = f"{type(exc).__name__}: {exc}"
+                rewards.append(0.0)
+                steps_taken = i
+                log_step(step=i, action=_action_str(action), reward=0.0, done=False, error=last_error)
+        # 3) Score from final state
+        try:
+            final_state = env.state()
+            score = compute_task_score(final_state, sum(rewards))
+        except Exception as exc:  # noqa: BLE001
+            last_error = last_error or f"state_error: {exc}"
+            score = 0.0
+    except Exception as exc:  # noqa: BLE001
+        last_error = f"{type(exc).__name__}: {exc}"
+        traceback.print_exc()
+    success = score >= 0.20  # any meaningful progress counts as a successful episode
+    log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    return {
+        "task_id": task_id,
+        "success": success,
+        "steps": steps_taken,
+        "score": score,
+        "rewards": rewards,
+        "error": last_error,
+    }
+# ---------------------------------------------------------------------------
+# Main — runs all 3 tasks sequentially against ONE env handle
+# ---------------------------------------------------------------------------
+def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    print(
+        f"[DEBUG] inference.py starting | model={MODEL_NAME} | "
+        f"base_url={API_BASE_URL} | tasks={TASKS}",
+        flush=True,
+    )
+    env = _EnvHandle().open()
+    print(f"[DEBUG] env mode={env.mode}", flush=True)
+    summary: list[dict] = []
+    t0 = time.time()
+    try:
+        for task_id in TASKS:
+            result = run_task(env, client, task_id=task_id, seed=42)
+            summary.append(result)
+    finally:
+        env.close()
+    elapsed = time.time() - t0
+    avg_score = sum(r["score"] for r in summary) / max(len(summary), 1)
+    print(
+        f"[DEBUG] inference.py finished in {elapsed:.1f}s | "
+        f"avg_score={avg_score:.3f}",
+        flush=True,
+    )
+    print("[DEBUG] per-task scores: " + json.dumps(
+        {r["task_id"]: round(r["score"], 3) for r in summary}
+    ), flush=True)
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,110 @@

+"""
+Data models for the API Testing Environment.
+Defines Action, Observation, State for API integration testing training.
+An AI agent learns to test REST APIs intelligently — discovering endpoints,
+crafting requests, validating responses, finding bugs, and handling edge cases.
+"""
+from enum import Enum
+from typing import Any, Optional
+from pydantic import Field
+from openenv.core.env_server.types import Action, Observation, State
+class HTTPMethod(str, Enum):
+    GET = "GET"
+    POST = "POST"
+    PUT = "PUT"
+    DELETE = "DELETE"
+    PATCH = "PATCH"
+class BugSeverity(str, Enum):
+    EASY = "easy"
+    MEDIUM = "medium"
+    HARD = "hard"
+class APITestAction(Action):
+    """What the agent sends each step — an HTTP request to test the API."""
+    method: HTTPMethod = Field(..., description="HTTP method")
+    endpoint: str = Field(..., min_length=1, description="API endpoint path, e.g. /tasks, /users/1")
+    headers: dict[str, str] = Field(default_factory=dict, description="Request headers")
+    query_params: dict[str, Any] = Field(default_factory=dict, description="URL query parameters")
+    body: Optional[dict[str, Any]] = Field(default=None, description="Request JSON body")
+    expected_status: Optional[int] = Field(
+        default=None,
+        description="What the agent expects the status code to be (used for bug detection)",
+    )
+class EndpointInfo(Action):
+    """Information about a single API endpoint from the spec."""
+    method: str = ""
+    path: str = ""
+    summary: str = ""
+    parameters: list[dict[str, Any]] = Field(default_factory=list)
+    request_body_schema: Optional[dict[str, Any]] = None
+    response_schema: Optional[dict[str, Any]] = None
+class APITestObservation(Observation):
+    """What the agent sees after each step."""
+    # API spec info (provided on reset, updated each step)
+    available_endpoints: list[dict[str, Any]] = Field(
+        default_factory=list, description="Available API endpoints from the spec"
+    )
+    # Response from last request
+    status_code: int = Field(default=0, description="HTTP status code of the response")
+    response_body: Any = Field(default=None, description="Response body (JSON or text)")
+    response_headers: dict[str, str] = Field(default_factory=dict, description="Response headers")
+    response_time_ms: float = Field(default=0.0, description="Response time in milliseconds")
+    # Feedback
+    feedback: str = Field(default="", description="Human-readable feedback about the last action")
+    bugs_found_so_far: int = Field(default=0, description="Number of bugs found so far")
+    coverage_summary: dict[str, Any] = Field(
+        default_factory=dict,
+        description="Coverage stats: endpoints_tested, methods_used, status_codes_seen",
+    )
+    # Context from prior steps
+    known_resource_ids: dict[str, list[Any]] = Field(
+        default_factory=dict,
+        description="Resource IDs created by POST requests, keyed by resource type",
+    )
+    auth_tokens: dict[str, str] = Field(
+        default_factory=dict,
+        description="Available auth tokens for different users/roles",
+    )
+    # Task info
+    task_id: str = Field(default="", description="Current task identifier")
+    task_description: str = Field(default="", description="Description of the current task")
+    steps_taken: int = Field(default=0, description="Steps taken in this episode")
+    max_steps: int = Field(default=30, description="Maximum steps per episode")
+class APITestState(State):
+    """Episode metadata — internal state exposed via state() endpoint."""
+    task_id: str = ""
+    task_description: str = ""
+    difficulty: str = "easy"
+    steps_taken: int = 0
+    max_steps: int = 30
+    bugs_found: int = 0
+    total_bugs: int = 0
+    bugs_found_ids: list[str] = Field(default_factory=list)
+    coverage_pct: float = 0.0
+    endpoints_tested: int = 0
+    total_endpoints: int = 0
+    current_score: float = 0.0
+    cumulative_reward: float = 0.0

openenv.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+spec_version: 1
+name: api_testing_env
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000

openenv_api_testing.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,19 @@

+Metadata-Version: 2.4
+Name: openenv-api-testing
+Version: 0.1.0
+Summary: RL environment for intelligent API integration testing — train agents to find bugs in REST APIs
+Requires-Python: >=3.10
+Requires-Dist: openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git@v0.2.1
+Requires-Dist: fastapi>=0.104.0
+Requires-Dist: uvicorn>=0.24.0
+Requires-Dist: httpx>=0.25.0
+Requires-Dist: pydantic>=2.0.0
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
+Provides-Extra: train
+Requires-Dist: trl[vllm]>=0.29.0; extra == "train"
+Requires-Dist: torch>=2.8.0; extra == "train"
+Requires-Dist: peft; extra == "train"
+Requires-Dist: transformers; extra == "train"
+Requires-Dist: datasets; extra == "train"

openenv_api_testing.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,26 @@

+README.md
+pyproject.toml
+./__init__.py
+./baseline.py
+./client.py
+./models.py
+openenv_api_testing.egg-info/PKG-INFO
+openenv_api_testing.egg-info/SOURCES.txt
+openenv_api_testing.egg-info/dependency_links.txt
+openenv_api_testing.egg-info/entry_points.txt
+openenv_api_testing.egg-info/requires.txt
+openenv_api_testing.egg-info/top_level.txt
+server/__init__.py
+server/app.py
+server/bug_detector.py
+server/environment.py
+server/graders.py
+server/reward.py
+server/buggy_api/__init__.py
+server/buggy_api/database.py
+server/buggy_api/main.py
+server/buggy_api/models.py
+server/buggy_api/routes/__init__.py
+server/buggy_api/routes/auth.py
+server/buggy_api/routes/tasks.py
+server/buggy_api/routes/users.py

openenv_api_testing.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

openenv_api_testing.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = api_testing_env.server.app:main

openenv_api_testing.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git@v0.2.1
+fastapi>=0.104.0
+uvicorn>=0.24.0
+httpx>=0.25.0
+pydantic>=2.0.0
+[dev]
+pytest>=8.0.0
+pytest-cov>=4.0.0
+[train]
+trl[vllm]>=0.29.0
+torch>=2.8.0
+peft
+transformers
+datasets

openenv_api_testing.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ api_testing_env

pyproject.toml ADDED Viewed

	@@ -0,0 +1,60 @@

+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-api-testing"
+version = "0.1.0"
+description = "RL environment for intelligent API integration testing — train agents to find bugs in REST APIs"
+requires-python = ">=3.10"
+dependencies = [
+    "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git@v0.2.1",
+    "fastapi>=0.104.0",
+    "uvicorn>=0.24.0",
+    "httpx>=0.25.0",
+    "pydantic>=2.0.0",
+    "openai>=1.40.0",
+    "gradio>=5.0.0",
+]
+[project.optional-dependencies]
+ui = [
+    "gradio>=5.0.0",
+]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+train = [
+    "trl>=0.15.0",
+    "torch>=2.1.0",
+    "peft>=0.7.0",
+    "transformers>=4.40.0",
+    "datasets>=2.16.0",
+    "wandb>=0.16.0",
+    "huggingface-hub>=0.20.0",
+    "matplotlib>=3.8.0",
+]
+[project.scripts]
+server = "api_testing_env.server.app:main"
+[tool.uv]
+package = false
+[tool.setuptools]
+include-package-data = true
+packages = [
+    "api_testing_env",
+    "api_testing_env.server",
+    "api_testing_env.server.buggy_api",
+    "api_testing_env.server.buggy_api.routes",
+    "api_testing_env.training",
+]
+[tool.setuptools.package-dir]
+api_testing_env = "."
+"api_testing_env.server" = "server"
+"api_testing_env.server.buggy_api" = "server/buggy_api"
+"api_testing_env.server.buggy_api.routes" = "server/buggy_api/routes"
+"api_testing_env.training" = "training"

requirements.txt ADDED Viewed

	@@ -0,0 +1,27 @@

+# Core dependencies
+openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git@v0.2.1
+fastapi>=0.104.0
+uvicorn>=0.24.0
+httpx>=0.25.0
+pydantic>=2.0.0,<2.12
+# Training dependencies
+# NOTE: PyTorch is NOT listed here — it must be installed separately
+# with the correct CUDA version. See setup.sh or run:
+#   pip install torch --index-url https://download.pytorch.org/whl/cu121
+trl>=0.15.0
+peft>=0.7.0
+transformers>=4.40.0
+datasets>=2.16.0
+# Weights & Biases (optional but recommended)
+wandb>=0.16.0
+# HuggingFace Hub (for model push)
+huggingface-hub>=0.20.0
+# Plots and metrics
+matplotlib>=3.8.0
+# UI
+gradio>=5.0.0

server/__init__.py ADDED Viewed

File without changes

server/app.py ADDED Viewed

	@@ -0,0 +1,135 @@

+"""
+FastAPI application for the API Testing Environment.
+Endpoints:
+    - POST /reset:  Reset the environment
+    - POST /step:   Execute an action
+    - GET  /state:  Get current environment state
+    - GET  /schema: Get action/observation schemas
+    - WS   /ws:     WebSocket endpoint for persistent sessions
+    - GET  /        Info page
+Usage:
+    uvicorn server.app:app --host 0.0.0.0 --port 8000
+"""
+import os
+import logging
+try:
+    from openenv.core.env_server.http_server import create_app
+    from ..models import APITestAction, APITestObservation
+    from .environment import APITestEnvironment
+except ImportError:
+    from openenv.core.env_server.http_server import create_app
+    from models import APITestAction, APITestObservation
+    from server.environment import APITestEnvironment
+from fastapi.responses import RedirectResponse
+logger = logging.getLogger(__name__)
+app = create_app(
+    APITestEnvironment,
+    APITestAction,
+    APITestObservation,
+    env_name="api_testing_env",
+    max_concurrent_envs=int(os.environ.get("MAX_ENVS", "1")),
+)
+# Track whether the Gradio UI is available so root can redirect to it
+_GRADIO_MOUNTED = False
+@app.get("/info")
+async def info():
+    """JSON info about the environment (replaces the old `/` JSON endpoint)."""
+    return {
+        "name": "API Testing Environment",
+        "description": "An OpenEnv RL environment where an AI agent learns to test REST APIs intelligently",
+        "tasks": ["basic_validation", "edge_cases", "security_workflows"],
+        "ui": "/ui",
+        "docs": "/docs",
+        "schema": "/schema",
+    }
+@app.get("/tasks")
+async def list_tasks():
+    """List available tasks with descriptions."""
+    from .environment import TASKS
+    return {
+        task_id: {
+            "description": task["description"],
+            "difficulty": task["difficulty"],
+            "max_steps": task["max_steps"],
+            "total_bugs": task["total_bugs"],
+        }
+        for task_id, task in TASKS.items()
+    }
+# ---------------------------------------------------------------------------
+# Mount Gradio UI at /ui (only if gradio is installed and ENABLE_WEB_INTERFACE)
+# ---------------------------------------------------------------------------
+if os.environ.get("ENABLE_WEB_INTERFACE", "true").lower() in ("1", "true", "yes"):
+    try:
+        import gradio as gr  # type: ignore
+        # Make the repo root importable so gradio_app's `from models import ...` works
+        import sys
+        _REPO_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
+        if _REPO_ROOT not in sys.path:
+            sys.path.insert(0, _REPO_ROOT)
+        from gradio_app import build_ui  # type: ignore
+        _gradio_ui = build_ui()
+        app = gr.mount_gradio_app(app, _gradio_ui, path="/ui")
+        _GRADIO_MOUNTED = True
+        logger.info("Gradio UI mounted at /ui")
+    except Exception as exc:  # noqa: BLE001
+        logger.warning(f"Skipping Gradio mount ({type(exc).__name__}: {exc})")
+# ---------------------------------------------------------------------------
+# Root redirect: send visitors to the Gradio UI if mounted, else to JSON info
+# ---------------------------------------------------------------------------
+@app.get("/", include_in_schema=False)
+async def root_redirect():
+    """Redirect / to the Gradio UI when available, otherwise to /info JSON."""
+    if _GRADIO_MOUNTED:
+        return RedirectResponse(url="/ui", status_code=307)
+    return RedirectResponse(url="/info", status_code=307)
+def main(host: str = None, port: int = None):
+    """Entry point for `uv run server` and `python -m server.app`.
+    When invoked from the CLI without args, parses argv for --host / --port.
+    """
+    import uvicorn
+    if host is None or port is None:
+        import argparse
+        parser = argparse.ArgumentParser(description="API Testing Environment server")
+        parser.add_argument("--host", default="0.0.0.0")
+        parser.add_argument("--port", type=int, default=None)
+        args, _ = parser.parse_known_args()
+        host = host or args.host
+        port = port or args.port
+    if port is None:
+        port = int(os.environ.get("PORT", "8000"))
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(asctime)s %(levelname)s [%(name)s] %(message)s",
+    )
+    logging.getLogger("httpx").setLevel(logging.WARNING)
+    logging.getLogger("httpcore").setLevel(logging.WARNING)
+    logging.getLogger("uvicorn.access").setLevel(logging.WARNING)
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    main()

server/bug_detector.py ADDED Viewed

	@@ -0,0 +1,430 @@

+"""
+Bug detection logic — checks if the agent's action/response pair reveals a planted bug.
+Each bug has:
+- A unique ID
+- A severity level (easy/medium/hard)
+- A detection function that checks action + response
+"""
+from dataclasses import dataclass
+from typing import Any, Callable, Optional
+import re
+@dataclass
+class Bug:
+    id: str
+    severity: str  # "easy", "medium", "hard"
+    description: str
+    category: str  # "status_code", "validation", "security", "data_integrity"
+    owasp: str = ""  # OWASP API Security Top 10 (2023) category
+    recommendation: str = ""  # Fix recommendation for bug bounty reports
+@dataclass
+class BugDetection:
+    bug: Bug
+    evidence: str  # Human-readable explanation of how the bug was detected
+class BugDetector:
+    """Detects planted bugs based on agent actions and API responses."""
+    def __init__(self, task_id: str):
+        self.task_id = task_id
+        self._build_bug_registry()
+    def _build_bug_registry(self):
+        """Define all bugs with their detection logic."""
+        self.bugs: dict[str, Bug] = {}
+        self.detectors: dict[str, Callable] = {}
+        # === EASY BUGS ===
+        self._register_bug(
+            Bug("BUG_TASK_01", "easy",
+                "GET /tasks/{id} returns 200 with null for non-existent task",
+                "status_code",
+                owasp="API8:2023 Security Misconfiguration",
+                recommendation="Return 404 Not Found for non-existent resources"),
+            self._detect_null_response_for_missing_task,
+        )
+        self._register_bug(
+            Bug("BUG_TASK_02", "easy",
+                "POST /tasks with missing title returns 500 instead of 400/422",
+                "validation",
+                owasp="API8:2023 Security Misconfiguration",
+                recommendation="Validate required fields and return 400/422 with descriptive error"),
+            self._detect_missing_field_500,
+        )
+        self._register_bug(
+            Bug("BUG_TASK_03", "easy",
+                "GET /tasks?page=-1 returns 200 instead of 400",
+                "validation",
+                owasp="API8:2023 Security Misconfiguration",
+                recommendation="Validate pagination parameters: page >= 1, limit > 0"),
+            self._detect_negative_page,
+        )
+        # === MEDIUM BUGS ===
+        self._register_bug(
+            Bug("BUG_TASK_04", "medium",
+                "PUT /tasks/{id} accepts invalid email format for assignee_email",
+                "validation",
+                owasp="API8:2023 Security Misconfiguration",
+                recommendation="Validate email format with regex before accepting"),
+            self._detect_invalid_email_accepted,
+        )
+        self._register_bug(
+            Bug("BUG_TASK_05", "medium",
+                "DELETE /tasks/{id} returns 200 for non-existent task",
+                "status_code",
+                owasp="API8:2023 Security Misconfiguration",
+                recommendation="Check resource existence before deletion, return 404 if missing"),
+            self._detect_delete_nonexistent_200,
+        )
+        self._register_bug(
+            Bug("BUG_TASK_06", "medium",
+                "GET /tasks?limit=999999 has no pagination cap",
+                "validation",
+                owasp="API4:2023 Unrestricted Resource Consumption",
+                recommendation="Cap pagination limit at 100, reject values above maximum"),
+            self._detect_no_pagination_cap,
+        )
+        self._register_bug(
+            Bug("BUG_USER_01", "medium",
+                "POST /users accepts invalid email format",
+                "validation",
+                owasp="API8:2023 Security Misconfiguration",
+                recommendation="Validate email format server-side before creating user"),
+            self._detect_user_invalid_email,
+        )
+        self._register_bug(
+            Bug("BUG_USER_02", "medium",
+                "POST /users response exposes password hash",
+                "security",
+                owasp="API3:2023 Broken Object Property Level Authorization",
+                recommendation="Never return sensitive fields (password_hash) in API responses"),
+            self._detect_password_hash_exposed,
+        )
+        self._register_bug(
+            Bug("BUG_AUTH_02", "medium",
+                "Login with empty password succeeds",
+                "security",
+                owasp="API2:2023 Broken Authentication",
+                recommendation="Validate password is non-empty and verify against stored hash"),
+            self._detect_empty_password_login,
+        )
+        # === HARD BUGS ===
+        self._register_bug(
+            Bug("BUG_TASK_07", "hard",
+                "BOLA: User A can access User B's tasks without authorization check",
+                "security",
+                owasp="API1:2023 Broken Object Level Authorization",
+                recommendation="Verify resource ownership: check task.owner_id matches authenticated user"),
+            self._detect_bola,
+        )
+        self._register_bug(
+            Bug("BUG_TASK_08", "hard",
+                "POST /tasks with very long title (>5000 chars) causes 500",
+                "validation",
+                owasp="API4:2023 Unrestricted Resource Consumption",
+                recommendation="Add input length validation: title max 200 chars"),
+            self._detect_long_input_crash,
+        )
+        self._register_bug(
+            Bug("BUG_TASK_09", "hard",
+                "SQL injection payload in title is stored verbatim (content injection)",
+                "security",
+                owasp="API8:2023 Security Misconfiguration",
+                recommendation="Sanitize user input before storage, escape HTML/SQL special characters"),
+            self._detect_content_injection,
+        )
+        self._register_bug(
+            Bug("BUG_AUTH_01", "hard",
+                "Auth tokens not user-scoped: User A's token can modify User B's tasks",
+                "security",
+                owasp="API1:2023 Broken Object Level Authorization",
+                recommendation="Enforce ownership check on all write operations (PUT/DELETE)"),
+            self._detect_broken_auth,
+        )
+    def _register_bug(self, bug: Bug, detector: Callable):
+        self.bugs[bug.id] = bug
+        self.detectors[bug.id] = detector
+    def get_bugs_for_task(self) -> list[Bug]:
+        """Return bugs relevant to the current task."""
+        if self.task_id == "basic_validation":
+            return [self.bugs[bid] for bid in ["BUG_TASK_01", "BUG_TASK_02", "BUG_TASK_03"]]
+        elif self.task_id == "edge_cases":
+            return [
+                self.bugs[bid]
+                for bid in [
+                    "BUG_TASK_01", "BUG_TASK_02", "BUG_TASK_03",
+                    "BUG_TASK_04", "BUG_TASK_05", "BUG_TASK_06",
+                    "BUG_USER_01", "BUG_USER_02", "BUG_AUTH_02",
+                ]
+            ]
+        else:  # security_workflows
+            return list(self.bugs.values())
+    def check(
+        self,
+        method: str,
+        endpoint: str,
+        headers: dict,
+        query_params: dict,
+        body: Optional[dict],
+        expected_status: Optional[int],
+        response_status: int,
+        response_body: Any,
+        action_history: list[dict],
+        found_bugs: set[str],
+    ) -> Optional[BugDetection]:
+        """Check if this action/response reveals a bug.
+        Returns the first new bug detected, or None.
+        """
+        ctx = {
+            "method": method.upper(),
+            "endpoint": endpoint,
+            "headers": headers,
+            "query_params": query_params,
+            "body": body,
+            "expected_status": expected_status,
+            "response_status": response_status,
+            "response_body": response_body,
+            "action_history": action_history,
+        }
+        for bug_id, detector in self.detectors.items():
+            if bug_id in found_bugs:
+                continue
+            # Only check bugs relevant to this task
+            task_bugs = {b.id for b in self.get_bugs_for_task()}
+            if bug_id not in task_bugs:
+                continue
+            result = detector(ctx)
+            if result:
+                return BugDetection(bug=self.bugs[bug_id], evidence=result)
+        return None
+    # === DETECTION FUNCTIONS ===
+    def _detect_null_response_for_missing_task(self, ctx: dict) -> Optional[str]:
+        if (
+            ctx["method"] == "GET"
+            and re.match(r"^/tasks/\d+$", ctx["endpoint"])
+            and ctx["response_status"] == 200
+            and ctx["response_body"] is None
+        ):
+            task_id = ctx["endpoint"].split("/")[-1]
+            return f"GET /tasks/{task_id} returned 200 with null body — should be 404"
+        return None
+    def _detect_missing_field_500(self, ctx: dict) -> Optional[str]:
+        if (
+            ctx["method"] == "POST"
+            and ctx["endpoint"] == "/tasks"
+            and ctx["response_status"] == 500
+            and ctx["body"] is not None
+            and "title" not in ctx["body"]
+        ):
+            return "POST /tasks with missing 'title' returned 500 — should be 400 or 422"
+        return None
+    def _detect_negative_page(self, ctx: dict) -> Optional[str]:
+        if (
+            ctx["method"] == "GET"
+            and ctx["endpoint"] == "/tasks"
+            and ctx["query_params"].get("page") is not None
+        ):
+            page = ctx["query_params"]["page"]
+            try:
+                page = int(page)
+            except (ValueError, TypeError):
+                return None
+            if page < 1 and ctx["response_status"] == 200:
+                return f"GET /tasks?page={page} returned 200 — should be 400 for invalid page"
+        return None
+    def _detect_invalid_email_accepted(self, ctx: dict) -> Optional[str]:
+        if (
+            ctx["method"] == "PUT"
+            and re.match(r"^/tasks/\d+$", ctx["endpoint"])
+            and ctx["body"]
+            and "assignee_email" in ctx["body"]
+            and ctx["response_status"] in (200, 201)
+        ):
+            email = ctx["body"]["assignee_email"]
+            if email and not re.match(r"^[^@]+@[^@]+\.[^@]+$", email):
+                return f"PUT accepted invalid email '{email}' without validation"
+        return None
+    def _detect_delete_nonexistent_200(self, ctx: dict) -> Optional[str]:
+        if (
+            ctx["method"] == "DELETE"
+            and re.match(r"^/tasks/\d+$", ctx["endpoint"])
+            and ctx["response_status"] == 200
+        ):
+            task_id = int(ctx["endpoint"].split("/")[-1])
+            # Check if this task was never created (ID > 1000 is a safe bet for non-existent)
+            if task_id > 100:
+                return f"DELETE /tasks/{task_id} returned 200 for non-existent task — should be 404"
+        return None
+    def _detect_no_pagination_cap(self, ctx: dict) -> Optional[str]:
+        if (
+            ctx["method"] == "GET"
+            and ctx["endpoint"] == "/tasks"
+            and ctx["response_status"] == 200
+        ):
+            limit = ctx["query_params"].get("limit")
+            if limit is not None:
+                try:
+                    limit = int(limit)
+                except (ValueError, TypeError):
+                    return None
+                if limit > 1000:
+                    return f"GET /tasks?limit={limit} accepted without pagination cap — potential DoS"
+        return None
+    def _detect_user_invalid_email(self, ctx: dict) -> Optional[str]:
+        if (
+            ctx["method"] == "POST"
+            and ctx["endpoint"] == "/users"
+            and ctx["body"]
+            and "email" in ctx["body"]
+            and ctx["response_status"] == 201
+        ):
+            email = ctx["body"]["email"]
+            if email and not re.match(r"^[^@]+@[^@]+\.[^@]+$", email):
+                return f"POST /users accepted invalid email '{email}'"
+        return None
+    def _detect_password_hash_exposed(self, ctx: dict) -> Optional[str]:
+        if (
+            ctx["method"] == "POST"
+            and ctx["endpoint"] == "/users"
+            and ctx["response_status"] == 201
+            and isinstance(ctx["response_body"], dict)
+        ):
+            if "password_hash" in ctx["response_body"]:
+                return "POST /users response exposes password_hash field — security vulnerability"
+        return None
+    def _detect_empty_password_login(self, ctx: dict) -> Optional[str]:
+        if (
+            ctx["method"] == "POST"
+            and ctx["endpoint"] == "/auth/login"
+            and ctx["body"]
+            and ctx["response_status"] == 200
+        ):
+            password = ctx["body"].get("password", "NOTEMPTY")
+            if password == "" or password is None:
+                return "Login with empty password succeeded — authentication bypass"
+        return None
+    def _detect_bola(self, ctx: dict) -> Optional[str]:
+        """Detect if agent tested cross-user resource access."""
+        if (
+            ctx["method"] == "GET"
+            and re.match(r"^/tasks/\d+$", ctx["endpoint"])
+            and ctx["response_status"] == 200
+            and ctx["response_body"] is not None
+            and isinstance(ctx["response_body"], dict)
+            and ctx["headers"].get("Authorization")
+        ):
+            # Check if the agent logged in as a different user and accessed another's task
+            for prev in reversed(ctx["action_history"]):
+                if (
+                    prev.get("method") == "POST"
+                    and prev.get("endpoint") == "/auth/login"
+                    and prev.get("response_status") == 200
+                    and isinstance(prev.get("response_body"), dict)
+                ):
+                    login_user_id = prev["response_body"].get("user_id")
+                    task_owner_id = ctx["response_body"].get("owner_id")
+                    if (
+                        login_user_id is not None
+                        and task_owner_id is not None
+                        and login_user_id != task_owner_id
+                    ):
+                        return (
+                            f"User {login_user_id} accessed task owned by user {task_owner_id} "
+                            f"— BOLA/IDOR vulnerability (no authorization check)"
+                        )
+        return None
+    def _detect_long_input_crash(self, ctx: dict) -> Optional[str]:
+        if (
+            ctx["method"] == "POST"
+            and ctx["endpoint"] == "/tasks"
+            and ctx["body"]
+            and ctx["response_status"] == 500
+        ):
+            title = ctx["body"].get("title", "")
+            if isinstance(title, str) and len(title) > 5000:
+                return f"POST /tasks with title length {len(title)} caused 500 — no input length validation"
+        return None
+    def _detect_content_injection(self, ctx: dict) -> Optional[str]:
+        if (
+            ctx["method"] == "POST"
+            and ctx["endpoint"] == "/tasks"
+            and ctx["body"]
+            and ctx["response_status"] == 201
+            and isinstance(ctx["response_body"], dict)
+        ):
+            title = ctx["body"].get("title", "")
+            injection_patterns = [
+                "DROP TABLE", "DELETE FROM", "<script>", "javascript:",
+                "'; --", "\" OR 1=1", "UNION SELECT",
+            ]
+            for pattern in injection_patterns:
+                if pattern.lower() in str(title).lower():
+                    stored_title = ctx["response_body"].get("title", "")
+                    if pattern.lower() in str(stored_title).lower():
+                        return (
+                            f"Injection payload '{pattern}' in title was stored verbatim "
+                            f"— no input sanitization (content injection)"
+                        )
+        return None
+    def _detect_broken_auth(self, ctx: dict) -> Optional[str]:
+        """Detect if agent successfully modified another user's task with their own token."""
+        if (
+            ctx["method"] in ("PUT", "DELETE")
+            and re.match(r"^/tasks/\d+$", ctx["endpoint"])
+            and ctx["response_status"] == 200
+            and ctx["headers"].get("Authorization")
+        ):
+            for prev in reversed(ctx["action_history"]):
+                if (
+                    prev.get("method") == "POST"
+                    and prev.get("endpoint") == "/auth/login"
+                    and prev.get("response_status") == 200
+                    and isinstance(prev.get("response_body"), dict)
+                ):
+                    login_user_id = prev["response_body"].get("user_id")
+                    # Check if the task belonged to a different user
+                    task_id = int(ctx["endpoint"].split("/")[-1])
+                    if isinstance(ctx["response_body"], dict):
+                        task_owner = ctx["response_body"].get("owner_id")
+                        if (
+                            login_user_id is not None
+                            and task_owner is not None
+                            and login_user_id != task_owner
+                        ):
+                            return (
+                                f"User {login_user_id}'s token modified task owned by user {task_owner} "
+                                f"— broken authorization"
+                            )
+                    break
+        return None

server/buggy_api/__init__.py ADDED Viewed

File without changes

server/buggy_api/database.py ADDED Viewed

	@@ -0,0 +1,209 @@

+"""
+In-memory SQLite database for the buggy API.
+Supports reset between episodes with DOMAIN RANDOMIZATION —
+each seed produces different users, tasks, and data distributions
+so that every training episode is unique.
+"""
+import random
+import sqlite3
+import threading
+from contextlib import contextmanager
+# Name pools for randomized seed data
+FIRST_NAMES = [
+    "alice", "bob", "charlie", "diana", "ethan", "fiona", "george", "hannah",
+    "ivan", "julia", "kevin", "luna", "mike", "nina", "oscar", "priya",
+    "quinn", "ravi", "sara", "tom", "uma", "victor", "wendy", "xander",
+]
+DOMAINS = ["example.com", "company.org", "startup.io", "work.dev", "test.net"]
+TASK_TITLES = [
+    "Setup CI/CD pipeline", "Write unit tests", "Fix login page CSS",
+    "Database migration", "API documentation", "Refactor auth module",
+    "Add rate limiting", "Setup monitoring", "Fix memory leak",
+    "Update dependencies", "Add logging middleware", "Create admin panel",
+    "Implement caching", "Fix CORS issues", "Add input validation",
+    "Setup Docker compose", "Write integration tests", "Fix date parsing bug",
+    "Add search functionality", "Implement pagination", "Setup SSL certs",
+    "Add webhook support", "Fix timezone handling", "Create backup script",
+    "Optimize database queries", "Add email notifications", "Fix file upload",
+    "Implement user roles", "Add audit logging", "Setup load balancer",
+]
+TASK_DESCRIPTIONS = [
+    "Configure GitHub Actions for automated deployment",
+    "Add tests for the auth module endpoints",
+    "Button alignment issue on mobile devices",
+    "Migrate from SQLite to PostgreSQL",
+    "Document all REST endpoints with examples",
+    "Break down the monolithic auth into smaller services",
+    "Prevent API abuse with request throttling",
+    "Setup Grafana dashboards for key metrics",
+    "Memory usage grows unbounded after 1000 requests",
+    "Several packages have critical CVEs",
+    "Add structured JSON logging to all routes",
+    "Build an admin dashboard for user management",
+    "Add Redis caching layer for frequent queries",
+    "Frontend gets blocked by CORS policy",
+    "Sanitize user inputs to prevent injection",
+]
+STATUSES = ["pending", "in_progress", "done"]
+PRIORITIES = ["low", "medium", "high"]
+class Database:
+    """Thread-safe in-memory SQLite database that can be reset between episodes.
+    When a seed is provided, the database is populated with deterministically
+    randomized data — different users, tasks, and distributions each time.
+    This prevents the agent from memorizing a single fixed dataset.
+    """
+    def __init__(self, seed: int | None = None):
+        self._lock = threading.Lock()
+        self._conn: sqlite3.Connection | None = None
+        self._seed = seed
+        self.initialize()
+    def initialize(self):
+        """Create a fresh database with schema and seed data."""
+        with self._lock:
+            if self._conn:
+                self._conn.close()
+            self._conn = sqlite3.connect(":memory:", check_same_thread=False)
+            self._conn.row_factory = sqlite3.Row
+            self._conn.execute("PRAGMA journal_mode=WAL")
+            self._create_schema()
+            self._seed_data()
+    def _create_schema(self):
+        cursor = self._conn.cursor()
+        cursor.executescript("""
+            CREATE TABLE IF NOT EXISTS users (
+                id INTEGER PRIMARY KEY AUTOINCREMENT,
+                username TEXT UNIQUE NOT NULL,
+                email TEXT NOT NULL,
+                password_hash TEXT NOT NULL,
+                role TEXT DEFAULT 'user',
+                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
+            );
+            CREATE TABLE IF NOT EXISTS tasks (
+                id INTEGER PRIMARY KEY AUTOINCREMENT,
+                title TEXT NOT NULL,
+                description TEXT DEFAULT '',
+                status TEXT DEFAULT 'pending',
+                priority TEXT DEFAULT 'medium',
+                assignee_email TEXT DEFAULT '',
+                owner_id INTEGER NOT NULL,
+                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+                updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+                FOREIGN KEY (owner_id) REFERENCES users(id)
+            );
+            CREATE TABLE IF NOT EXISTS auth_tokens (
+                token TEXT PRIMARY KEY,
+                user_id INTEGER NOT NULL,
+                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+                expires_at TIMESTAMP,
+                FOREIGN KEY (user_id) REFERENCES users(id)
+            );
+        """)
+        self._conn.commit()
+    def _seed_data(self):
+        """Seed the database with randomized data based on the seed.
+        With seed=None, uses a fixed default dataset (for manual testing).
+        With a seed, generates random users/tasks so every episode differs.
+        """
+        rng = random.Random(self._seed)
+        cursor = self._conn.cursor()
+        if self._seed is None:
+            # Default fixed data for manual testing / Gradio UI
+            cursor.executescript("""
+                INSERT INTO users (username, email, password_hash, role) VALUES
+                    ('alice', 'alice@example.com', 'hashed_password123', 'admin'),
+                    ('bob', 'bob@example.com', 'hashed_password123', 'user'),
+                    ('charlie', 'charlie@example.com', 'hashed_password123', 'user');
+                INSERT INTO tasks (title, description, status, priority, assignee_email, owner_id) VALUES
+                    ('Setup CI/CD pipeline', 'Configure GitHub Actions', 'in_progress', 'high', 'alice@example.com', 1),
+                    ('Write unit tests', 'Add tests for auth module', 'pending', 'medium', 'bob@example.com', 2),
+                    ('Fix login page CSS', 'Button alignment issue', 'done', 'low', 'charlie@example.com', 3),
+                    ('Database migration', 'Migrate to PostgreSQL', 'pending', 'high', 'alice@example.com', 1),
+                    ('API documentation', 'Document all endpoints', 'in_progress', 'medium', 'bob@example.com', 2);
+            """)
+        else:
+            # Randomized data — different every episode
+            # Pick 3-5 users from the name pool
+            num_users = rng.randint(3, 5)
+            user_names = rng.sample(FIRST_NAMES, num_users)
+            domain = rng.choice(DOMAINS)
+            # First user is always admin, rest are regular users
+            for i, name in enumerate(user_names):
+                role = "admin" if i == 0 else "user"
+                email = f"{name}@{domain}"
+                cursor.execute(
+                    "INSERT INTO users (username, email, password_hash, role) VALUES (?, ?, ?, ?)",
+                    (name, email, f"hashed_password_{rng.randint(100, 999)}", role),
+                )
+            # Pick 4-8 tasks with random assignments
+            num_tasks = rng.randint(4, 8)
+            task_titles = rng.sample(TASK_TITLES, min(num_tasks, len(TASK_TITLES)))
+            task_descs = rng.sample(TASK_DESCRIPTIONS, min(num_tasks, len(TASK_DESCRIPTIONS)))
+            for i in range(num_tasks):
+                owner_id = rng.randint(1, num_users)
+                assignee_id = rng.randint(1, num_users)
+                assignee_email = f"{user_names[assignee_id - 1]}@{domain}"
+                cursor.execute(
+                    "INSERT INTO tasks (title, description, status, priority, assignee_email, owner_id) VALUES (?, ?, ?, ?, ?, ?)",
+                    (
+                        task_titles[i % len(task_titles)],
+                        task_descs[i % len(task_descs)] if i < len(task_descs) else "",
+                        rng.choice(STATUSES),
+                        rng.choice(PRIORITIES),
+                        assignee_email,
+                        owner_id,
+                    ),
+                )
+        self._conn.commit()
+    @property
+    def user_names(self) -> list[str]:
+        """Get usernames in the database (for the agent's observation)."""
+        rows = self.execute("SELECT username FROM users ORDER BY id")
+        return [r["username"] for r in rows]
+    @contextmanager
+    def get_cursor(self):
+        with self._lock:
+            cursor = self._conn.cursor()
+            try:
+                yield cursor
+                self._conn.commit()
+            except Exception:
+                self._conn.rollback()
+                raise
+    def execute(self, query: str, params: tuple = ()) -> list[dict]:
+        with self.get_cursor() as cursor:
+            cursor.execute(query, params)
+            if cursor.description:
+                columns = [col[0] for col in cursor.description]
+                return [dict(zip(columns, row)) for row in cursor.fetchall()]
+            return []
+    def execute_insert(self, query: str, params: tuple = ()) -> int:
+        with self.get_cursor() as cursor:
+            cursor.execute(query, params)
+            return cursor.lastrowid
+    def execute_update(self, query: str, params: tuple = ()) -> int:
+        with self.get_cursor() as cursor:
+            cursor.execute(query, params)
+            return cursor.rowcount

server/buggy_api/main.py ADDED Viewed

	@@ -0,0 +1,91 @@

+"""
+The deliberately buggy REST API — a task management system.
+This API is the system-under-test. It has intentionally planted bugs at varying
+difficulty levels that the AI agent must discover through intelligent testing.
+The API runs in-process via Starlette's TestClient (no separate port needed).
+"""
+import json
+import logging
+from fastapi import FastAPI, Request, Header
+from fastapi.responses import JSONResponse
+from typing import Optional
+from .database import Database
+from .routes import tasks as tasks_routes
+from .routes import users as users_routes
+from .routes import auth as auth_routes
+from .models import TaskCreate
+logger = logging.getLogger(__name__)
+def create_buggy_api(db: Database) -> FastAPI:
+    """Create a fresh buggy API instance wired to the given database."""
+    api = FastAPI(
+        title="TaskTracker API",
+        description="A task management API (with bugs)",
+        version="1.0.0",
+    )
+    # Wire database into route modules
+    tasks_routes.set_db(db)
+    users_routes.set_db(db)
+    auth_routes.set_db(db)
+    # Include standard routes
+    api.include_router(tasks_routes.router)
+    api.include_router(users_routes.router)
+    api.include_router(auth_routes.router)
+    # BUG_TASK_02 + BUG_TASK_08: Raw POST /tasks handler that doesn't use Pydantic validation
+    # This allows missing fields and overly long inputs to cause 500 errors
+    @api.post("/tasks", status_code=201)
+    async def create_task_raw(
+        request: Request,
+        authorization: Optional[str] = Header(None),
+    ):
+        try:
+            body = await request.json()
+        except Exception:
+            # BUG_TASK_02: Returns 500 on malformed/empty body instead of 400
+            raise Exception("Failed to parse request body")
+        if not isinstance(body, dict):
+            raise Exception("Invalid body format")
+        title = body.get("title")
+        # BUG_TASK_02: No check for missing title — causes KeyError/500 below
+        if title is None:
+            # This SHOULD return 400, but we let it fall through to cause 500
+            # Simulate an internal error from missing required field
+            raise Exception("Internal error: title is required but was None")
+        # BUG_TASK_08: No length validation on title
+        if len(title) > 5000:
+            # Simulate a database error from overly long input
+            raise Exception(f"Database error: value too long for column 'title' (length={len(title)})")
+        task_data = TaskCreate(
+            title=title,
+            description=body.get("description", ""),
+            status=body.get("status", "pending"),
+            priority=body.get("priority", "medium"),
+            assignee_email=body.get("assignee_email", ""),
+        )
+        return tasks_routes.create_task_internal(task_data, authorization)
+    # Global error handler — returns 500 for unhandled exceptions
+    @api.exception_handler(Exception)
+    async def global_exception_handler(request: Request, exc: Exception):
+        logger.error(f"Unhandled error: {exc}")
+        return JSONResponse(
+            status_code=500,
+            content={"error": "Internal Server Error", "detail": str(exc)},
+        )
+    return api

server/buggy_api/models.py ADDED Viewed

	@@ -0,0 +1,64 @@

+"""Pydantic models for the buggy API request/response schemas."""
+from pydantic import BaseModel, Field
+from typing import Optional
+class UserCreate(BaseModel):
+    username: str
+    email: str
+    password: str
+    role: str = "user"
+class UserResponse(BaseModel):
+    id: int
+    username: str
+    email: str
+    role: str
+    created_at: str
+class TaskCreate(BaseModel):
+    title: str
+    description: str = ""
+    status: str = "pending"
+    priority: str = "medium"
+    assignee_email: str = ""
+class TaskUpdate(BaseModel):
+    title: Optional[str] = None
+    description: Optional[str] = None
+    status: Optional[str] = None
+    priority: Optional[str] = None
+    assignee_email: Optional[str] = None
+class TaskResponse(BaseModel):
+    id: int
+    title: str
+    description: str
+    status: str
+    priority: str
+    assignee_email: str
+    owner_id: int
+    created_at: str
+    updated_at: str
+class LoginRequest(BaseModel):
+    username: str
+    password: str
+class LoginResponse(BaseModel):
+    token: str
+    user_id: int
+    username: str
+    role: str
+class ErrorResponse(BaseModel):
+    error: str
+    detail: str = ""

server/buggy_api/routes/__init__.py ADDED Viewed

File without changes

server/buggy_api/routes/auth.py ADDED Viewed

	@@ -0,0 +1,82 @@

+"""
+Authentication routes with planted bugs.
+BUGS PLANTED:
+- BUG_AUTH_01 (hard): Auth tokens are not user-scoped — any valid token works for any user's resources
+- BUG_AUTH_02 (medium): Login with empty password succeeds (missing validation)
+"""
+import uuid
+from datetime import datetime, timedelta
+from fastapi import APIRouter, Depends, Header, HTTPException
+from typing import Optional
+from ..database import Database
+from ..models import LoginRequest, LoginResponse
+router = APIRouter(prefix="/auth", tags=["auth"])
+_db: Database | None = None
+def set_db(db: Database):
+    global _db
+    _db = db
+def get_db() -> Database:
+    return _db
+def get_current_user(authorization: Optional[str] = Header(None)) -> dict | None:
+    """Extract user from auth token.
+    BUG_AUTH_01: Returns the token's user but doesn't enforce ownership anywhere.
+    The routes that use this don't check if the resource belongs to the user.
+    """
+    if not authorization:
+        return None
+    token = authorization.replace("Bearer ", "")
+    db = get_db()
+    rows = db.execute(
+        "SELECT u.id, u.username, u.role FROM auth_tokens t JOIN users u ON t.user_id = u.id WHERE t.token = ?",
+        (token,),
+    )
+    if not rows:
+        return None
+    return rows[0]
+@router.post("/login", response_model=LoginResponse)
+def login(req: LoginRequest):
+    db = get_db()
+    # BUG_AUTH_02: Empty password check is missing — empty password matches hash
+    # Should validate: if not req.password: raise HTTPException(400, ...)
+    rows = db.execute(
+        "SELECT id, username, role, password_hash FROM users WHERE username = ?",
+        (req.username,),
+    )
+    if not rows:
+        raise HTTPException(status_code=401, detail="Invalid credentials")
+    user = rows[0]
+    # BUG_AUTH_02 continued: Only checks username, not password properly
+    # In a real system we'd verify the password hash
+    # Here we just check if password is non-empty... but we don't!
+    # Any password (including empty string) works as long as username exists.
+    token = str(uuid.uuid4())
+    expires = datetime.utcnow() + timedelta(hours=24)
+    db.execute_insert(
+        "INSERT INTO auth_tokens (token, user_id, expires_at) VALUES (?, ?, ?)",
+        (token, user["id"], expires.isoformat()),
+    )
+    return LoginResponse(
+        token=token,
+        user_id=user["id"],
+        username=user["username"],
+        role=user["role"],
+    )

server/buggy_api/routes/tasks.py ADDED Viewed

	@@ -0,0 +1,210 @@

+"""
+Task CRUD routes with planted bugs.
+BUGS PLANTED:
+- BUG_TASK_01 (easy):   GET /tasks/{id} returns 200 with null body for non-existent task (should be 404)
+- BUG_TASK_02 (easy):   POST /tasks with missing required 'title' returns 500 instead of 400/422
+- BUG_TASK_03 (easy):   GET /tasks?page=-1 returns 200 instead of 400
+- BUG_TASK_04 (medium):  PUT /tasks/{id} doesn't validate assignee_email format
+- BUG_TASK_05 (medium):  DELETE /tasks/{id} returns 200 even for non-existent task (should be 404)
+- BUG_TASK_06 (medium):  GET /tasks?limit=999999 has no pagination cap (potential DoS)
+- BUG_TASK_07 (hard):    GET /tasks/{id} of another user's task returns data (BOLA/IDOR vulnerability)
+- BUG_TASK_08 (hard):    POST /tasks with very long title (>5000 chars) causes 500 (no input length validation)
+- BUG_TASK_09 (hard):    POST /tasks with SQL injection payload in title doesn't sanitize (uses parameterized
+                         queries so no actual injection, but the input is stored verbatim — a content injection)
+- BUG_TASK_10 (hard):    No rate limiting — rapid sequential requests all succeed
+"""
+from fastapi import APIRouter, HTTPException, Header, Query
+from typing import Optional
+from ..database import Database
+from ..models import TaskCreate, TaskUpdate
+router = APIRouter(prefix="/tasks", tags=["tasks"])
+_db: Database | None = None
+# Simple in-memory cache for BUG demonstration
+_cache: dict[int, dict] = {}
+def set_db(db: Database):
+    global _db, _cache
+    _db = db
+    _cache = {}
+def get_db() -> Database:
+    return _db
+@router.get("")
+def list_tasks(
+    status: Optional[str] = Query(None, description="Filter by status"),
+    priority: Optional[str] = Query(None, description="Filter by priority"),
+    sort: Optional[str] = Query(None, description="Sort field"),
+    page: Optional[int] = Query(None, description="Page number"),
+    limit: Optional[int] = Query(None, description="Items per page"),
+    authorization: Optional[str] = Header(None),
+):
+    db = get_db()
+    # BUG_TASK_03: No validation for negative page numbers
+    # Should check: if page is not None and page < 1: raise HTTPException(400, ...)
+    # BUG_TASK_06: No cap on limit — agent can request limit=999999
+    # Should cap at e.g. 100
+    query = "SELECT * FROM tasks WHERE 1=1"
+    params = []
+    if status:
+        query += " AND status = ?"
+        params.append(status)
+    if priority:
+        query += " AND priority = ?"
+        params.append(priority)
+    if sort:
+        allowed_sorts = ["created_at", "updated_at", "title", "priority", "status"]
+        if sort in allowed_sorts:
+            query += f" ORDER BY {sort}"
+        else:
+            query += " ORDER BY created_at"
+    else:
+        query += " ORDER BY created_at DESC"
+    if limit is not None:
+        # BUG_TASK_06: No upper bound check on limit
+        query += " LIMIT ?"
+        params.append(limit)
+    else:
+        query += " LIMIT 20"
+    if page is not None and limit is not None:
+        # BUG_TASK_03: Allows negative offset — page=-1 with limit=10 gives offset=-10
+        offset = (page - 1) * limit
+        query += " OFFSET ?"
+        params.append(offset)
+    rows = db.execute(query, tuple(params))
+    return rows
+@router.get("/{task_id}")
+def get_task(
+    task_id: int,
+    authorization: Optional[str] = Header(None),
+):
+    db = get_db()
+    # Check cache first (used later for stale cache bug)
+    if task_id in _cache:
+        return _cache[task_id]
+    rows = db.execute("SELECT * FROM tasks WHERE id = ?", (task_id,))
+    # BUG_TASK_01: Returns 200 with null instead of 404
+    if not rows:
+        return None  # Should be: raise HTTPException(status_code=404, detail="Task not found")
+    task = rows[0]
+    # BUG_TASK_07: No ownership check — any authenticated user can see any task
+    # Should check: if user and task["owner_id"] != user["id"]: raise HTTPException(403)
+    # Cache the result
+    _cache[task_id] = task
+    return task
+@router.post("/create", status_code=201)
+def create_task_internal(
+    task: TaskCreate,
+    authorization: Optional[str] = Header(None),
+):
+    """Internal create — used by the raw handler after parsing."""
+    db = get_db()
+    # BUG_TASK_08: No title length validation
+    # Should check: if len(task.title) > 200: raise HTTPException(400, ...)
+    # BUG_TASK_09: No content sanitization — SQL injection payloads stored verbatim
+    # While parameterized queries prevent actual SQL injection, the content
+    # is stored and returned as-is, which is a content injection / XSS vector
+    # Determine owner — default to user 1 if no auth
+    owner_id = 1
+    if authorization:
+        token = authorization.replace("Bearer ", "")
+        token_rows = db.execute(
+            "SELECT user_id FROM auth_tokens WHERE token = ?", (token,)
+        )
+        if token_rows:
+            owner_id = token_rows[0]["user_id"]
+    task_id = db.execute_insert(
+        "INSERT INTO tasks (title, description, status, priority, assignee_email, owner_id) VALUES (?, ?, ?, ?, ?, ?)",
+        (task.title, task.description, task.status, task.priority, task.assignee_email, owner_id),
+    )
+    rows = db.execute("SELECT * FROM tasks WHERE id = ?", (task_id,))
+    result = rows[0]
+    _cache[task_id] = result
+    return result
+@router.put("/{task_id}")
+def update_task(
+    task_id: int,
+    task: TaskUpdate,
+    authorization: Optional[str] = Header(None),
+):
+    db = get_db()
+    existing = db.execute("SELECT * FROM tasks WHERE id = ?", (task_id,))
+    if not existing:
+        raise HTTPException(status_code=404, detail="Task not found")
+    # BUG_TASK_04: No email format validation on assignee_email
+    # Should validate if task.assignee_email is provided
+    # BUG_TASK_07: No ownership check on update either
+    updates = []
+    params = []
+    for field_name in ["title", "description", "status", "priority", "assignee_email"]:
+        value = getattr(task, field_name, None)
+        if value is not None:
+            updates.append(f"{field_name} = ?")
+            params.append(value)
+    if updates:
+        updates.append("updated_at = CURRENT_TIMESTAMP")
+        params.append(task_id)
+        db.execute_update(
+            f"UPDATE tasks SET {', '.join(updates)} WHERE id = ?",
+            tuple(params),
+        )
+    rows = db.execute("SELECT * FROM tasks WHERE id = ?", (task_id,))
+    result = rows[0]
+    _cache[task_id] = result
+    return result
+@router.delete("/{task_id}")
+def delete_task(
+    task_id: int,
+    authorization: Optional[str] = Header(None),
+):
+    db = get_db()
+    # BUG_TASK_05: No existence check — returns 200 even for non-existent tasks
+    # Should check existence first and return 404
+    db.execute_update("DELETE FROM tasks WHERE id = ?", (task_id,))
+    # Note: cache is NOT cleared — this enables stale cache detection
+    # (BUG_TASK_01 variant: deleted task still returned from cache)
+    return {"message": "Task deleted", "id": task_id}

server/buggy_api/routes/users.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""
+User management routes with planted bugs.
+BUGS PLANTED:
+- BUG_USER_01 (medium): POST /users doesn't validate email format
+- BUG_USER_02 (medium): GET /users exposes password hashes in response
+"""
+from fastapi import APIRouter, HTTPException
+from ..database import Database
+from ..models import UserCreate
+router = APIRouter(prefix="/users", tags=["users"])
+_db: Database | None = None
+def set_db(db: Database):
+    global _db
+    _db = db
+def get_db() -> Database:
+    return _db
+@router.get("")
+def list_users():
+    db = get_db()
+    rows = db.execute("SELECT id, username, email, role, created_at FROM users")
+    return rows
+@router.get("/{user_id}")
+def get_user(user_id: int):
+    db = get_db()
+    rows = db.execute("SELECT id, username, email, role, created_at FROM users WHERE id = ?", (user_id,))
+    if not rows:
+        raise HTTPException(status_code=404, detail="User not found")
+    return rows[0]
+@router.post("", status_code=201)
+def create_user(user: UserCreate):
+    db = get_db()
+    # BUG_USER_01: No email format validation — accepts "not-an-email" or empty string
+    # Should validate email with regex or pydantic EmailStr
+    # Check username uniqueness
+    existing = db.execute("SELECT id FROM users WHERE username = ?", (user.username,))
+    if existing:
+        raise HTTPException(status_code=409, detail="Username already exists")
+    user_id = db.execute_insert(
+        "INSERT INTO users (username, email, password_hash, role) VALUES (?, ?, ?, ?)",
+        (user.username, user.email, f"hashed_{user.password}", user.role),
+    )
+    # BUG_USER_02: Response includes password_hash field
+    rows = db.execute("SELECT * FROM users WHERE id = ?", (user_id,))
+    return rows[0]

server/environment.py ADDED Viewed

	@@ -0,0 +1,438 @@

+"""
+OpenEnv Environment for API Integration Testing.
+The agent interacts with a deliberately buggy REST API, discovering endpoints,
+crafting requests, and finding bugs. Rewards are multi-signal: coverage,
+validity, bug discovery, and exploration.
+"""
+import logging
+import random
+import time
+import json
+from typing import Any, Optional
+from fastapi.testclient import TestClient
+from openenv.core.env_server.interfaces import Environment
+try:
+    from ..models import APITestAction, APITestObservation, APITestState
+except ImportError:
+    from models import APITestAction, APITestObservation, APITestState
+from .buggy_api.database import Database
+from .buggy_api.main import create_buggy_api
+from .bug_detector import BugDetector
+from .reward import RewardComputer
+from .graders import TaskGrader, generate_bug_report
+from .graders import TaskGrader
+logger = logging.getLogger(__name__)
+# Task definitions
+TASKS = {
+    "basic_validation": {
+        "id": "basic_validation",
+        "description": (
+            "Test all CRUD endpoints with valid inputs and verify correct status codes. "
+            "Find basic bugs like wrong status codes and missing field handling. "
+            "Available endpoints: GET /tasks, POST /tasks, GET /tasks/{id}, PUT /tasks/{id}, "
+            "DELETE /tasks/{id}, GET /users, POST /users, POST /auth/login. "
+            "Try different methods on each endpoint and verify responses match the expected behavior."
+        ),
+        "difficulty": "easy",
+        "max_steps": 25,
+        "total_bugs": 3,
+    },
+    "edge_cases": {
+        "id": "edge_cases",
+        "description": (
+            "Test boundary conditions, invalid inputs, and error responses. "
+            "Send missing fields, wrong types, negative page numbers, huge limits. "
+            "Test with non-existent resource IDs (e.g., /tasks/999999). "
+            "Chain operations: create a resource, then read/update/delete it. "
+            "Find bugs in input validation, pagination, and error handling."
+        ),
+        "difficulty": "medium",
+        "max_steps": 35,
+        "total_bugs": 9,
+    },
+    "security_workflows": {
+        "id": "security_workflows",
+        "description": (
+            "Discover authorization flaws, injection vulnerabilities, and workflow bugs. "
+            "Login as different users (alice/password, bob/password, charlie/password) and "
+            "try accessing each other's resources. Test SQL injection patterns in input fields. "
+            "Execute multi-step workflows: create -> modify -> verify -> delete -> re-fetch. "
+            "Check if auth tokens properly scope access. Test with very long inputs."
+        ),
+        "difficulty": "hard",
+        "max_steps": 45,
+        "total_bugs": 13,
+    },
+}
+# OpenAPI-like spec for the agent
+API_SPEC = [
+    {
+        "method": "GET",
+        "path": "/tasks",
+        "summary": "List all tasks. Supports filtering by status, priority; pagination with page & limit; sorting with sort.",
+        "parameters": [
+            {"name": "status", "in": "query", "type": "string", "enum": ["pending", "in_progress", "done"]},
+            {"name": "priority", "in": "query", "type": "string", "enum": ["low", "medium", "high"]},
+            {"name": "sort", "in": "query", "type": "string", "enum": ["created_at", "updated_at", "title"]},
+            {"name": "page", "in": "query", "type": "integer"},
+            {"name": "limit", "in": "query", "type": "integer"},
+        ],
+    },
+    {
+        "method": "POST",
+        "path": "/tasks",
+        "summary": "Create a new task. Requires 'title' field. Optional: description, status, priority, assignee_email.",
+        "request_body": {
+            "required": ["title"],
+            "properties": {
+                "title": {"type": "string"},
+                "description": {"type": "string"},
+                "status": {"type": "string", "enum": ["pending", "in_progress", "done"]},
+                "priority": {"type": "string", "enum": ["low", "medium", "high"]},
+                "assignee_email": {"type": "string", "format": "email"},
+            },
+        },
+    },
+    {
+        "method": "GET",
+        "path": "/tasks/{id}",
+        "summary": "Get a specific task by ID.",
+        "parameters": [{"name": "id", "in": "path", "type": "integer", "required": True}],
+    },
+    {
+        "method": "PUT",
+        "path": "/tasks/{id}",
+        "summary": "Update a task. All fields optional.",
+        "parameters": [{"name": "id", "in": "path", "type": "integer", "required": True}],
+        "request_body": {
+            "properties": {
+                "title": {"type": "string"},
+                "description": {"type": "string"},
+                "status": {"type": "string"},
+                "priority": {"type": "string"},
+                "assignee_email": {"type": "string", "format": "email"},
+            },
+        },
+    },
+    {
+        "method": "DELETE",
+        "path": "/tasks/{id}",
+        "summary": "Delete a task by ID.",
+        "parameters": [{"name": "id", "in": "path", "type": "integer", "required": True}],
+    },
+    {
+        "method": "GET",
+        "path": "/users",
+        "summary": "List all users.",
+    },
+    {
+        "method": "POST",
+        "path": "/users",
+        "summary": "Create a new user. Requires username, email, password.",
+        "request_body": {
+            "required": ["username", "email", "password"],
+            "properties": {
+                "username": {"type": "string"},
+                "email": {"type": "string", "format": "email"},
+                "password": {"type": "string"},
+                "role": {"type": "string", "enum": ["user", "admin"]},
+            },
+        },
+    },
+    {
+        "method": "GET",
+        "path": "/users/{id}",
+        "summary": "Get a specific user by ID.",
+        "parameters": [{"name": "id", "in": "path", "type": "integer", "required": True}],
+    },
+    {
+        "method": "POST",
+        "path": "/auth/login",
+        "summary": "Login and receive an auth token. Pre-seeded users: alice, bob, charlie (password: any string).",
+        "request_body": {
+            "required": ["username", "password"],
+            "properties": {
+                "username": {"type": "string"},
+                "password": {"type": "string"},
+            },
+        },
+    },
+]
+class APITestEnvironment(Environment):
+    """OpenEnv environment for API integration testing.
+    The agent tests a deliberately buggy REST API by sending HTTP requests
+    and analyzing responses. It earns rewards for coverage, finding bugs,
+    and exploring edge cases.
+    """
+    SUPPORTS_CONCURRENT_SESSIONS = False
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self._db: Optional[Database] = None
+        self._api: Optional[TestClient] = None
+        self._bug_detector: Optional[BugDetector] = None
+        self._reward_computer: Optional[RewardComputer] = None
+        self._task: Optional[dict] = None
+        self._found_bugs: set[str] = set()
+        self._steps_taken: int = 0
+        self._cumulative_reward: float = 0.0
+        self._action_history: list[dict] = []
+        self._auth_tokens: dict[str, str] = {}
+        self._episode_id: str = ""
+    def reset(self, seed=None, episode_id=None, **kwargs) -> APITestObservation:
+        """Reset the environment for a new episode.
+        Args:
+            seed: Random seed for domain randomization. When provided, the
+                  database is populated with different users, tasks, and data
+                  so each training episode is unique. None = fixed default data.
+            episode_id: Optional episode identifier for tracking.
+        kwargs:
+            task_id: str - one of "basic_validation", "edge_cases", "security_workflows"
+        """
+        task_id = kwargs.get("task_id", "basic_validation")
+        if task_id not in TASKS:
+            task_id = "basic_validation"
+        self._task = TASKS[task_id]
+        self._seed = seed
+        self._episode_id = episode_id or f"ep_{int(time.time())}"
+        # Reset database with seed for domain randomization
+        # seed=None → fixed data (manual testing / Gradio)
+        # seed=int  → randomized data (GRPO training)
+        self._db = Database(seed=seed)
+        buggy_app = create_buggy_api(self._db)
+        self._api = TestClient(buggy_app, raise_server_exceptions=False)
+        # Build dynamic task description that includes actual usernames
+        user_names = self._db.user_names
+        user_list = ", ".join(user_names)
+        dynamic_description = (
+            f"{self._task['description']} "
+            f"Users in the system: {user_list} (use any password to login)."
+        )
+        # Reset tracking
+        self._bug_detector = BugDetector(task_id)
+        self._reward_computer = RewardComputer()
+        self._found_bugs = set()
+        self._steps_taken = 0
+        self._cumulative_reward = 0.0
+        self._action_history = []
+        self._auth_tokens = {}
+        logger.info(f"Reset environment: task={task_id}, seed={seed}, episode={self._episode_id}")
+        return APITestObservation(
+            available_endpoints=API_SPEC,
+            status_code=0,
+            response_body=None,
+            response_headers={},
+            response_time_ms=0,
+            feedback=(
+                f"Environment reset. Task: {dynamic_description} "
+                f"You have {self._task['max_steps']} steps. Start testing the API!"
+            ),
+            bugs_found_so_far=0,
+            coverage_summary=self._reward_computer.coverage.summary(),
+            known_resource_ids=self._reward_computer.created_ids,
+            auth_tokens=self._auth_tokens,
+            task_id=task_id,
+            task_description=dynamic_description,
+            steps_taken=0,
+            max_steps=self._task["max_steps"],
+            done=False,
+            reward=0.0,
+        )
+    def step(self, action: APITestAction, timeout_s=None, **kwargs) -> APITestObservation:
+        """Execute an API test action and return observation + reward."""
+        self._steps_taken += 1
+        # Forward request to buggy API
+        method = action.method.value if hasattr(action.method, "value") else str(action.method)
+        endpoint = action.endpoint
+        headers = dict(action.headers) if action.headers else {}
+        query_params = dict(action.query_params) if action.query_params else {}
+        body = action.body
+        # Make the request
+        start_time = time.time()
+        try:
+            response = self._api.request(
+                method=method.upper(),
+                url=endpoint,
+                headers=headers,
+                params=query_params if query_params else None,
+                json=body,
+            )
+            elapsed_ms = (time.time() - start_time) * 1000
+            response_status = response.status_code
+            try:
+                response_body = response.json()
+            except Exception:
+                response_body = response.text
+            response_headers = dict(response.headers)
+        except Exception as e:
+            elapsed_ms = (time.time() - start_time) * 1000
+            response_status = 0
+            response_body = {"error": str(e)}
+            response_headers = {}
+        # Track auth tokens from login responses
+        if (
+            endpoint == "/auth/login"
+            and response_status == 200
+            and isinstance(response_body, dict)
+            and "token" in response_body
+        ):
+            username = body.get("username", "unknown") if body else "unknown"
+            self._auth_tokens[username] = response_body["token"]
+        # Check for bug detection
+        detection = self._bug_detector.check(
+            method=method,
+            endpoint=endpoint,
+            headers=headers,
+            query_params=query_params,
+            body=body,
+            expected_status=action.expected_status,
+            response_status=response_status,
+            response_body=response_body,
+            action_history=self._action_history,
+            found_bugs=self._found_bugs,
+        )
+        bug_severity = None
+        bug_id = None
+        if detection:
+            bug_severity = detection.bug.severity
+            bug_id = detection.bug.id
+            self._found_bugs.add(bug_id)
+        # Compute reward
+        reward_breakdown = self._reward_computer.compute(
+            method=method,
+            endpoint=endpoint,
+            headers=headers,
+            query_params=query_params,
+            body=body,
+            expected_status=action.expected_status,
+            response_status=response_status,
+            response_body=response_body,
+            bug_found=bug_severity,
+            bug_id=bug_id,
+        )
+        self._cumulative_reward += reward_breakdown.total
+        # Record action in history
+        self._action_history.append({
+            "method": method,
+            "endpoint": endpoint,
+            "headers": headers,
+            "query_params": query_params,
+            "body": body,
+            "response_status": response_status,
+            "response_body": response_body,
+        })
+        # Generate feedback
+        feedback_parts = [f"{method} {endpoint} -> {response_status}"]
+        if detection:
+            feedback_parts.append(f"BUG FOUND ({detection.bug.severity})! {detection.evidence}")
+        if reward_breakdown.coverage > 0:
+            feedback_parts.append(f"Coverage +{reward_breakdown.coverage:.2f}")
+        if reward_breakdown.penalty < 0:
+            feedback_parts.append("Repeated request penalty")
+        done = self._steps_taken >= self._task["max_steps"]
+        # Compute final grade if done
+        if done:
+            grade = TaskGrader.grade(
+                task_id=self._task["id"],
+                bugs_found=self._found_bugs,
+                coverage_pct=self._reward_computer.coverage.summary()["coverage_pct"],
+                endpoints_tested=len(self._reward_computer.coverage.endpoints_hit),
+                total_endpoints=self._reward_computer.coverage.total_endpoints,
+                method_endpoint_pairs=len(self._reward_computer.coverage.method_endpoint_pairs),
+                status_codes_seen=self._reward_computer.coverage.status_codes_seen,
+                action_history=self._action_history,
+                created_resources=self._reward_computer.created_ids,
+            )
+            # Generate bug bounty report
+            report = generate_bug_report(list(self._found_bugs), self._action_history)
+            feedback_parts.append(
+                f"\n=== EPISODE COMPLETE ===\n"
+                f"Final Score: {grade.score:.4f}\n"
+                f"Bugs Found: {len(self._found_bugs)}/{self._task['total_bugs']}\n"
+                f"Grade Breakdown: {json.dumps(grade.breakdown, indent=2)}\n"
+                f"Feedback: {grade.feedback}\n\n"
+                f"{report}"
+            )
+            # Add grade as bonus on top of step reward (not replacement)
+            final_reward = reward_breakdown.total + grade.score
+        else:
+            final_reward = reward_breakdown.total
+        return APITestObservation(
+            available_endpoints=API_SPEC,
+            status_code=response_status,
+            response_body=response_body,
+            response_headers={k: v for k, v in list(response_headers.items())[:20]},
+            response_time_ms=round(elapsed_ms, 2),
+            feedback=" | ".join(feedback_parts),
+            bugs_found_so_far=len(self._found_bugs),
+            coverage_summary=self._reward_computer.coverage.summary(),
+            known_resource_ids=self._reward_computer.created_ids,
+            auth_tokens=self._auth_tokens,
+            task_id=self._task["id"],
+            task_description=self._task["description"],
+            steps_taken=self._steps_taken,
+            max_steps=self._task["max_steps"],
+            done=done,
+            reward=final_reward,
+            metadata={"reward_breakdown": reward_breakdown.as_dict()},
+        )
+    @property
+    def state(self) -> APITestState:
+        """Return current episode state."""
+        if not self._task:
+            return APITestState()
+        coverage = self._reward_computer.coverage.summary() if self._reward_computer else {}
+        return APITestState(
+            episode_id=self._episode_id,
+            step_count=self._steps_taken,
+            task_id=self._task["id"],
+            task_description=self._task["description"],
+            difficulty=self._task["difficulty"],
+            steps_taken=self._steps_taken,
+            max_steps=self._task["max_steps"],
+            bugs_found=len(self._found_bugs),
+            total_bugs=self._task["total_bugs"],
+            bugs_found_ids=list(self._found_bugs),
+            coverage_pct=coverage.get("coverage_pct", 0.0),
+            endpoints_tested=coverage.get("endpoints_tested", 0),
+            total_endpoints=coverage.get("total_endpoints", 0),
+            current_score=0.0,
+            cumulative_reward=round(self._cumulative_reward, 4),
+        )

server/graders.py ADDED Viewed

	@@ -0,0 +1,289 @@

+"""
+Task-specific grading logic and bug bounty report generation.
+Each task has a grader that computes a final score (0.0 - 1.0)
+based on what the agent accomplished during the episode.
+"""
+from dataclasses import dataclass
+@dataclass
+class GradeResult:
+    score: float
+    breakdown: dict[str, float]
+    feedback: str
+    report: str = ""  # Bug bounty report (markdown)
+def generate_bug_report(bugs_found_ids: list[str], action_history: list[dict]) -> str:
+    """Generate a structured bug bounty report for discovered bugs."""
+    from .bug_detector import BugDetector
+    detector = BugDetector("security_workflows")
+    if not bugs_found_ids:
+        return "## API Security Assessment Report\n\nNo vulnerabilities discovered."
+    severity_order = {"hard": 0, "medium": 1, "easy": 2}
+    sorted_bugs = sorted(bugs_found_ids, key=lambda b: severity_order.get(detector.bugs.get(b, None) and detector.bugs[b].severity or "easy", 2))
+    sections = ["## API Security Assessment Report", ""]
+    sections.append(f"**Vulnerabilities Found:** {len(bugs_found_ids)}")
+    # Count by severity
+    counts = {"easy": 0, "medium": 0, "hard": 0}
+    for bid in bugs_found_ids:
+        bug = detector.bugs.get(bid)
+        if bug:
+            counts[bug.severity] = counts.get(bug.severity, 0) + 1
+    sections.append(f"**Critical/Hard:** {counts['hard']} | **Medium:** {counts['medium']} | **Low/Easy:** {counts['easy']}")
+    sections.append("")
+    for bid in sorted_bugs:
+        bug = detector.bugs.get(bid)
+        if not bug:
+            continue
+        sev_label = {"easy": "LOW", "medium": "MEDIUM", "hard": "HIGH"}.get(bug.severity, "INFO")
+        owasp = bug.owasp if bug.owasp else "Uncategorized"
+        sections.append(f"### {sev_label}: {bug.description}")
+        sections.append(f"- **ID:** {bid}")
+        sections.append(f"- **OWASP:** {owasp}")
+        sections.append(f"- **Category:** {bug.category}")
+        sections.append(f"- **Recommendation:** {bug.recommendation}" if bug.recommendation else "")
+        # Find the action that triggered this bug
+        for h in action_history:
+            if h.get("method") and h.get("endpoint"):
+                sections.append(f"- **Triggered by:** {h['method']} {h['endpoint']}")
+                break
+        sections.append("")
+    return "\n".join(sections)
+class TaskGrader:
+    """Computes final scores for each task based on episode performance."""
+    @staticmethod
+    def grade(
+        task_id: str,
+        bugs_found: set[str],
+        coverage_pct: float,
+        endpoints_tested: int,
+        total_endpoints: int,
+        method_endpoint_pairs: int,
+        status_codes_seen: set[int],
+        action_history: list[dict],
+        created_resources: dict[str, list],
+    ) -> GradeResult:
+        if task_id == "basic_validation":
+            return TaskGrader._grade_basic(
+                bugs_found, coverage_pct, endpoints_tested, total_endpoints,
+                method_endpoint_pairs, status_codes_seen, action_history, created_resources,
+            )
+        elif task_id == "edge_cases":
+            return TaskGrader._grade_edge_cases(
+                bugs_found, coverage_pct, endpoints_tested, method_endpoint_pairs,
+                status_codes_seen, action_history, created_resources,
+            )
+        elif task_id == "security_workflows":
+            return TaskGrader._grade_security(
+                bugs_found, coverage_pct, action_history, created_resources,
+            )
+        return GradeResult(score=0.0, breakdown={}, feedback="Unknown task")
+    @staticmethod
+    def _grade_basic(
+        bugs_found, coverage_pct, endpoints_tested, total_endpoints,
+        method_endpoint_pairs, status_codes_seen, action_history, created_resources,
+    ) -> GradeResult:
+        breakdown = {}
+        # 0.25: Test all GET endpoints
+        get_endpoints = {
+            h.get("endpoint") for h in action_history
+            if h.get("method", "").upper() == "GET"
+        }
+        get_score = min(len(get_endpoints) / 4, 1.0) * 0.25
+        breakdown["get_coverage"] = round(get_score, 3)
+        # 0.20: Test POST with valid data
+        post_success = sum(
+            1 for h in action_history
+            if h.get("method", "").upper() == "POST" and h.get("response_status") == 201
+        )
+        post_score = min(post_success / 2, 1.0) * 0.20
+        breakdown["post_testing"] = round(post_score, 3)
+        # 0.15: Test PUT/DELETE
+        put_delete = sum(
+            1 for h in action_history
+            if h.get("method", "").upper() in ("PUT", "DELETE")
+        )
+        pd_score = min(put_delete / 2, 1.0) * 0.15
+        breakdown["put_delete"] = round(pd_score, 3)
+        # 0.20: Bug discovery (easy bugs: TASK_01, TASK_02, TASK_03)
+        easy_bugs = {"BUG_TASK_01", "BUG_TASK_02", "BUG_TASK_03"}
+        found_easy = len(bugs_found & easy_bugs)
+        bug_score = min(found_easy / 2, 1.0) * 0.20
+        breakdown["bugs_found"] = round(bug_score, 3)
+        # 0.20: Response schema validation (status codes variety)
+        schema_score = min(len(status_codes_seen) / 4, 1.0) * 0.20
+        breakdown["schema_validation"] = round(schema_score, 3)
+        score = sum(breakdown.values())
+        feedback_parts = []
+        if get_score > 0:
+            feedback_parts.append(f"GET coverage: {len(get_endpoints)} endpoints")
+        if post_success > 0:
+            feedback_parts.append(f"POST success: {post_success}")
+        if found_easy > 0:
+            feedback_parts.append(f"Bugs found: {found_easy}/{len(easy_bugs)}")
+        return GradeResult(
+            score=round(min(score, 1.0), 4),
+            breakdown=breakdown,
+            feedback="; ".join(feedback_parts) if feedback_parts else "No significant progress",
+        )
+    @staticmethod
+    def _grade_edge_cases(
+        bugs_found, coverage_pct, endpoints_tested, method_endpoint_pairs,
+        status_codes_seen, action_history, created_resources,
+    ) -> GradeResult:
+        breakdown = {}
+        # 0.15: Missing required fields testing
+        missing_field_tests = sum(
+            1 for h in action_history
+            if h.get("method", "").upper() == "POST"
+            and h.get("body") is not None
+            and isinstance(h.get("body"), dict)
+            and not h["body"].get("title")
+        )
+        breakdown["missing_fields"] = round(min(missing_field_tests / 2, 1.0) * 0.15, 3)
+        # 0.15: Invalid data type testing
+        invalid_tests = sum(
+            1 for h in action_history
+            if h.get("body") and isinstance(h.get("body"), dict)
+            and any(
+                isinstance(v, (list, bool)) or v == ""
+                for v in h["body"].values()
+            )
+        )
+        breakdown["invalid_types"] = round(min(invalid_tests / 2, 1.0) * 0.15, 3)
+        # 0.15: Boundary value testing (negative pages, huge limits, long strings)
+        boundary_tests = 0
+        for h in action_history:
+            qp = h.get("query_params", {})
+            if qp.get("page") is not None and int(str(qp.get("page", 1))) < 1:
+                boundary_tests += 1
+            if qp.get("limit") is not None and int(str(qp.get("limit", 10))) > 100:
+                boundary_tests += 1
+        breakdown["boundary_values"] = round(min(boundary_tests / 2, 1.0) * 0.15, 3)
+        # 0.15: Non-existent resource testing
+        nonexistent_tests = sum(
+            1 for h in action_history
+            if h.get("method", "").upper() in ("GET", "DELETE", "PUT")
+            and "/999" in h.get("endpoint", "")
+        )
+        breakdown["nonexistent_resources"] = round(min(nonexistent_tests / 2, 1.0) * 0.15, 3)
+        # 0.20: Bug discovery (medium bugs)
+        medium_bugs = {
+            "BUG_TASK_04", "BUG_TASK_05", "BUG_TASK_06",
+            "BUG_USER_01", "BUG_USER_02", "BUG_AUTH_02",
+        }
+        all_relevant = medium_bugs | {"BUG_TASK_01", "BUG_TASK_02", "BUG_TASK_03"}
+        found_relevant = len(bugs_found & all_relevant)
+        breakdown["bugs_found"] = round(min(found_relevant / 3, 1.0) * 0.20, 3)
+        # 0.20: Dependency chaining (create → read → update → delete)
+        chain_score = 0.0
+        if any(h.get("method") == "POST" and h.get("response_status") == 201 for h in action_history):
+            chain_score += 0.25
+        if created_resources.get("tasks"):
+            task_ids = created_resources["tasks"]
+            for tid in task_ids:
+                gets = [h for h in action_history if h.get("endpoint") == f"/tasks/{tid}" and h.get("method") == "GET"]
+                puts = [h for h in action_history if h.get("endpoint") == f"/tasks/{tid}" and h.get("method") == "PUT"]
+                deletes = [h for h in action_history if h.get("endpoint") == f"/tasks/{tid}" and h.get("method") == "DELETE"]
+                if gets:
+                    chain_score += 0.25
+                if puts:
+                    chain_score += 0.25
+                if deletes:
+                    chain_score += 0.25
+                break  # Only need one complete chain
+        breakdown["dependency_chaining"] = round(min(chain_score, 1.0) * 0.20, 3)
+        score = sum(breakdown.values())
+        return GradeResult(
+            score=round(min(score, 1.0), 4),
+            breakdown=breakdown,
+            feedback=f"Edge cases: {found_relevant} bugs found, chain score {chain_score:.0%}",
+        )
+    @staticmethod
+    def _grade_security(
+        bugs_found, coverage_pct, action_history, created_resources,
+    ) -> GradeResult:
+        breakdown = {}
+        # 0.20: Cross-user authorization testing
+        cross_user = False
+        login_users = set()
+        for h in action_history:
+            if h.get("endpoint") == "/auth/login" and h.get("response_status") == 200:
+                body = h.get("body", {})
+                if body:
+                    login_users.add(body.get("username"))
+        cross_user = len(login_users) >= 2
+        breakdown["cross_user_auth"] = 0.20 if cross_user else 0.0
+        # 0.20: Injection pattern testing
+        injection_attempted = sum(
+            1 for h in action_history
+            if h.get("body") and isinstance(h.get("body"), dict)
+            and any(
+                pattern.lower() in str(h["body"]).lower()
+                for pattern in ["DROP TABLE", "<script>", "OR 1=1", "UNION SELECT", "'; --"]
+            )
+        )
+        breakdown["injection_testing"] = round(min(injection_attempted / 2, 1.0) * 0.20, 3)
+        # 0.20: Multi-step state consistency
+        # Check if agent did: create → delete → re-fetch (stale cache test)
+        consistency_tests = 0
+        for i, h in enumerate(action_history):
+            if h.get("method") == "DELETE" and "/tasks/" in h.get("endpoint", ""):
+                # Check if agent re-fetched the same resource after deleting
+                deleted_endpoint = h["endpoint"]
+                for j in range(i + 1, len(action_history)):
+                    if action_history[j].get("endpoint") == deleted_endpoint and action_history[j].get("method") == "GET":
+                        consistency_tests += 1
+                        break
+        breakdown["state_consistency"] = round(min(consistency_tests, 1.0) * 0.20, 3)
+        # 0.20: Security bug discovery
+        security_bugs = {"BUG_TASK_07", "BUG_AUTH_01", "BUG_TASK_08", "BUG_TASK_09"}
+        found_security = len(bugs_found & security_bugs)
+        breakdown["security_bugs"] = round(min(found_security / 2, 1.0) * 0.20, 3)
+        # 0.20: Complete workflow coverage
+        workflow_coverage = min(coverage_pct / 80, 1.0)  # 80% coverage = full score
+        breakdown["workflow_coverage"] = round(workflow_coverage * 0.20, 3)
+        score = sum(breakdown.values())
+        return GradeResult(
+            score=round(min(score, 1.0), 4),
+            breakdown=breakdown,
+            feedback=f"Security: {found_security} security bugs, {len(login_users)} users tested, {injection_attempted} injection attempts",
+        )

server/reward.py ADDED Viewed

	@@ -0,0 +1,238 @@

+"""
+Multi-signal reward function for the API Testing Environment.
+Rewards are decomposed into:
+1. Coverage reward — exploring new endpoints/methods/status codes
+2. Validity reward — well-formed requests and proper dependency chaining
+3. Bug discovery reward — the core goal, scaled by severity
+4. Exploration bonus — trying novel actions
+5. Penalties — for repeating exact requests or malformed input
+"""
+from dataclasses import dataclass, field
+from typing import Any, Optional
+import re
+@dataclass
+class CoverageTracker:
+    """Tracks API coverage across the episode."""
+    endpoints_hit: set[str] = field(default_factory=set)
+    method_endpoint_pairs: set[tuple[str, str]] = field(default_factory=set)
+    status_codes_seen: set[int] = field(default_factory=set)
+    total_endpoints: int = 10  # known endpoint patterns
+    def record(self, method: str, endpoint: str, status_code: int) -> dict[str, bool]:
+        """Record a request and return what's new."""
+        normalized_endpoint = self._normalize_endpoint(endpoint)
+        pair = (method.upper(), normalized_endpoint)
+        is_new_endpoint = normalized_endpoint not in self.endpoints_hit
+        is_new_pair = pair not in self.method_endpoint_pairs
+        is_new_status = status_code not in self.status_codes_seen
+        self.endpoints_hit.add(normalized_endpoint)
+        self.method_endpoint_pairs.add(pair)
+        self.status_codes_seen.add(status_code)
+        return {
+            "new_endpoint": is_new_endpoint,
+            "new_method_endpoint": is_new_pair,
+            "new_status_code": is_new_status,
+        }
+    def _normalize_endpoint(self, endpoint: str) -> str:
+        """Normalize /tasks/42 to /tasks/{id}."""
+        normalized = re.sub(r"/(\d+)", "/{id}", endpoint)
+        return normalized.rstrip("/") or "/"
+    def summary(self) -> dict:
+        return {
+            "endpoints_tested": len(self.endpoints_hit),
+            "total_endpoints": self.total_endpoints,
+            "method_endpoint_pairs": len(self.method_endpoint_pairs),
+            "status_codes_seen": sorted(self.status_codes_seen),
+            "coverage_pct": round(len(self.endpoints_hit) / max(self.total_endpoints, 1) * 100, 1),
+        }
+@dataclass
+class RewardBreakdown:
+    coverage: float = 0.0
+    validity: float = 0.0
+    bug_discovery: float = 0.0
+    exploration: float = 0.0
+    penalty: float = 0.0
+    total: float = 0.0
+    def as_dict(self) -> dict:
+        return {
+            "coverage": round(self.coverage, 4),
+            "validity": round(self.validity, 4),
+            "bug_discovery": round(self.bug_discovery, 4),
+            "exploration": round(self.exploration, 4),
+            "penalty": round(self.penalty, 4),
+            "total": round(self.total, 4),
+        }
+class RewardComputer:
+    """Computes multi-signal rewards for API testing actions."""
+    def __init__(self):
+        self.coverage = CoverageTracker()
+        self.action_history: list[dict] = []
+        self.found_bugs: set[str] = set()
+        self.created_ids: dict[str, list[Any]] = {}  # resource type -> list of IDs
+    def reset(self):
+        self.coverage = CoverageTracker()
+        self.action_history = []
+        self.found_bugs = set()
+        self.created_ids = {}
+    def compute(
+        self,
+        method: str,
+        endpoint: str,
+        headers: dict,
+        query_params: dict,
+        body: Optional[dict],
+        expected_status: Optional[int],
+        response_status: int,
+        response_body: Any,
+        bug_found: Optional[str] = None,  # bug severity if found
+        bug_id: Optional[str] = None,
+    ) -> RewardBreakdown:
+        """Compute reward for this step."""
+        breakdown = RewardBreakdown()
+        # 1. Coverage reward (0.0 - 0.3)
+        coverage_info = self.coverage.record(method, endpoint, response_status)
+        if coverage_info["new_endpoint"]:
+            breakdown.coverage += 0.10
+        if coverage_info["new_method_endpoint"]:
+            breakdown.coverage += 0.05
+        if coverage_info["new_status_code"]:
+            breakdown.coverage += 0.05
+        # 2. Validity reward (0.0 - 0.2)
+        if response_status < 500:
+            breakdown.validity += 0.03  # Non-crash request
+        if self._used_dependency(method, endpoint, body, headers):
+            breakdown.validity += 0.10  # Used a previously created resource ID or auth token
+        if expected_status is not None and expected_status == response_status:
+            breakdown.validity += 0.05  # Correctly predicted status code
+        # Track created resources
+        self._track_created_resources(method, endpoint, response_status, response_body)
+        # 3. Bug discovery reward (0.0 - 0.4)
+        if bug_found and bug_id:
+            if bug_id not in self.found_bugs:
+                self.found_bugs.add(bug_id)
+                if bug_found == "easy":
+                    breakdown.bug_discovery += 0.10
+                elif bug_found == "medium":
+                    breakdown.bug_discovery += 0.15
+                elif bug_found == "hard":
+                    breakdown.bug_discovery += 0.25
+                # First discovery bonus
+                breakdown.bug_discovery += 0.05
+        # 4. Exploration bonus (0.0 - 0.1)
+        action_sig = self._action_signature(method, endpoint, query_params, body)
+        is_novel = all(
+            self._action_signature(
+                h.get("method", ""),
+                h.get("endpoint", ""),
+                h.get("query_params", {}),
+                h.get("body"),
+            )
+            != action_sig
+            for h in self.action_history
+        )
+        if is_novel:
+            breakdown.exploration += 0.05
+        # 5. Penalties
+        # Exact duplicate request
+        exact_match = any(
+            h.get("method") == method
+            and h.get("endpoint") == endpoint
+            and h.get("query_params") == query_params
+            and h.get("body") == body
+            and h.get("headers") == headers
+            for h in self.action_history
+        )
+        if exact_match:
+            breakdown.penalty -= 0.08
+        # Record this action in history
+        self.action_history.append({
+            "method": method,
+            "endpoint": endpoint,
+            "headers": headers,
+            "query_params": query_params,
+            "body": body,
+            "response_status": response_status,
+            "response_body": response_body,
+        })
+        # Total
+        breakdown.total = max(
+            breakdown.coverage + breakdown.validity + breakdown.bug_discovery + breakdown.exploration + breakdown.penalty,
+            -0.1,  # Floor to prevent extreme negative rewards
+        )
+        breakdown.total = min(breakdown.total, 1.0)
+        return breakdown
+    def _used_dependency(self, method: str, endpoint: str, body: Optional[dict], headers: dict) -> bool:
+        """Check if this request uses a resource ID or token from a previous step."""
+        endpoint_str = str(endpoint)
+        # Check if endpoint contains a known resource ID
+        for resource_type, ids in self.created_ids.items():
+            for rid in ids:
+                if str(rid) in endpoint_str:
+                    return True
+        # Check if using an auth token obtained from login
+        if headers.get("Authorization"):
+            for prev in self.action_history:
+                if (
+                    prev.get("endpoint") == "/auth/login"
+                    and prev.get("response_status") == 200
+                    and isinstance(prev.get("response_body"), dict)
+                    and "token" in prev["response_body"]
+                ):
+                    token = prev["response_body"]["token"]
+                    if token in headers["Authorization"]:
+                        return True
+        return False
+    def _track_created_resources(
+        self, method: str, endpoint: str, status: int, body: Any
+    ):
+        """Track resource IDs from POST responses."""
+        if method.upper() == "POST" and status == 201 and isinstance(body, dict):
+            resource_id = body.get("id")
+            if resource_id is not None:
+                # Determine resource type from endpoint
+                resource_type = endpoint.strip("/").split("/")[0]
+                if resource_type not in self.created_ids:
+                    self.created_ids[resource_type] = []
+                self.created_ids[resource_type].append(resource_id)
+    def _action_signature(
+        self, method: str, endpoint: str, query_params: dict, body: Optional[dict]
+    ) -> str:
+        """Create a signature for an action to check novelty."""
+        normalized = re.sub(r"/\d+", "/{id}", endpoint)
+        body_keys = sorted(body.keys()) if body else []
+        param_keys = sorted(query_params.keys()) if query_params else []
+        return f"{method}:{normalized}:{param_keys}:{body_keys}"

setup.sh ADDED Viewed

	@@ -0,0 +1,158 @@

+#!/bin/bash
+# ============================================================
+# API Testing Environment — One-command setup
+# ============================================================
+# Usage: bash setup.sh
+#
+# This script:
+#   1. Creates a virtual environment
+#   2. Detects your GPU and installs the correct PyTorch+CUDA
+#   3. Installs all project dependencies
+#   4. Verifies everything works
+# ============================================================
+set -e
+echo ""
+echo "============================================"
+echo "  API Testing Environment — Setup"
+echo "============================================"
+echo ""
+# --- Step 1: Create venv ---
+echo "[1/5] Setting up virtual environment..."
+if [ ! -d ".venv" ]; then
+    python3 -m venv .venv
+    echo "  Created .venv"
+else
+    echo "  .venv already exists"
+fi
+source .venv/bin/activate
+pip install --upgrade pip setuptools wheel -q
+echo "  Python: $(python3 --version)"
+echo "  pip: $(pip --version | awk '{print $2}')"
+echo ""
+# --- Step 2: Install PyTorch with correct CUDA ---
+echo "[2/5] Detecting GPU and installing PyTorch..."
+install_pytorch() {
+    if command -v nvidia-smi &> /dev/null; then
+        DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader 2>/dev/null | head -1)
+        DRIVER_MAJOR=$(echo "$DRIVER_VERSION" | cut -d. -f1)
+        GPU_NAME=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1)
+        GPU_MEM=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader 2>/dev/null | head -1)
+        echo "  GPU: $GPU_NAME ($GPU_MEM)"
+        echo "  NVIDIA driver: $DRIVER_VERSION"
+        if [ "$DRIVER_MAJOR" -ge 530 ]; then
+            echo "  -> Installing PyTorch + CUDA 12.1"
+            pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121 -q
+        elif [ "$DRIVER_MAJOR" -ge 450 ]; then
+            echo "  -> Installing PyTorch + CUDA 11.8 (older driver)"
+            pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118 -q
+        else
+            echo "  WARNING: Driver too old ($DRIVER_VERSION). Install CPU PyTorch."
+            echo "  Upgrade: https://www.nvidia.com/Download/index.aspx"
+            pip install torch torchvision -q
+        fi
+    else
+        echo "  No NVIDIA GPU detected."
+        # Check for Apple Silicon
+        if python3 -c "import platform; exit(0 if platform.processor() == 'arm' else 1)" 2>/dev/null; then
+            echo "  -> Apple Silicon detected, installing default PyTorch (MPS support)"
+        else
+            echo "  -> Installing CPU-only PyTorch"
+        fi
+        pip install torch torchvision -q
+    fi
+}
+install_pytorch
+echo ""
+# --- Step 3: Install project dependencies ---
+echo "[3/5] Installing project dependencies..."
+pip install -r requirements.txt -q
+echo "  Done."
+echo ""
+# --- Step 4: Verify everything ---
+echo "[4/5] Verifying installation..."
+echo ""
+python3 << 'PYEOF'
+import sys
+# Core
+import fastapi, uvicorn, pydantic, httpx
+print(f"  fastapi:      {fastapi.__version__}")
+# ML
+import torch
+print(f"  torch:        {torch.__version__}")
+cuda = torch.cuda.is_available()
+mps = hasattr(torch.backends, 'mps') and torch.backends.mps.is_available()
+if cuda:
+    print(f"  CUDA:         {torch.version.cuda}")
+    print(f"  GPU:          {torch.cuda.get_device_name(0)}")
+    print(f"  GPU memory:   {torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")
+elif mps:
+    print(f"  Device:       Apple MPS")
+else:
+    print(f"  Device:       CPU only (training will be slow!)")
+import transformers, trl, peft, datasets
+print(f"  transformers: {transformers.__version__}")
+print(f"  trl:          {trl.__version__}")
+print(f"  peft:         {peft.__version__}")
+# Optional
+try:
+    import wandb
+    print(f"  wandb:        {wandb.__version__}")
+except ImportError:
+    print(f"  wandb:        not installed (optional)")
+try:
+    import gradio
+    print(f"  gradio:       {gradio.__version__}")
+except ImportError:
+    print(f"  gradio:       not installed (optional)")
+# OpenEnv
+try:
+    import openenv
+    print(f"  openenv:      OK")
+except ImportError:
+    print(f"  openenv:      MISSING — run: pip install -r requirements.txt")
+# Environment test
+print("")
+sys.path.insert(0, ".")
+from server.environment import APITestEnvironment
+from models import APITestAction, HTTPMethod
+env = APITestEnvironment()
+obs = env.reset(seed=42, task_id="basic_validation")
+obs = env.step(APITestAction(method=HTTPMethod.GET, endpoint="/tasks/999999", expected_status=404))
+assert obs.bugs_found_so_far == 1, "Bug detection failed!"
+print(f"  Environment:  OK (bug detection verified)")
+PYEOF
+echo ""
+# --- Step 5: Done ---
+echo "============================================"
+echo "  Setup complete!"
+echo "============================================"
+echo ""
+echo "  Activate:     source .venv/bin/activate"
+echo ""
+echo "  Gradio UI:    python gradio_app.py"
+echo "  Baselines:    python -m training.evaluate --task all --agent all"
+echo "  Training:     python -m training.grpo --model-id Qwen/Qwen3-1.7B"
+echo "  Test mode:    python -m training.grpo --test-mode"
+echo ""
+echo "  For HF Hub:   huggingface-cli login"
+echo "  For W&B:      wandb login"
+echo ""

train_grpo.py ADDED Viewed

	@@ -0,0 +1,6 @@

+#!/usr/bin/env python3
+"""GRPO training — see training/grpo.py for the full implementation."""
+from training.grpo import main
+if __name__ == "__main__":
+    main()

training/README.md ADDED Viewed

	@@ -0,0 +1,392 @@

+# Training Module
+Everything related to training an AI agent to test APIs using GRPO (Group Relative Policy Optimization).
+---
+## Setup
+```bash
+cd api_testing_env
+# Option 1: Automated setup (creates venv, installs everything)
+bash setup.sh
+# Option 2: Manual setup
+python3 -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+# Optional: login to HuggingFace Hub (for model push)
+huggingface-cli login
+# Optional: login to Weights & Biases (for logging)
+wandb login
+```
+### Environment Variables
+Create a `.env` file in `api_testing_env/` (or export in your shell):
+```bash
+# .env
+# HuggingFace Hub — required for --push-to-hub
+# Get your token at: https://huggingface.co/settings/tokens
+HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+# Weights & Biases — required for --use-wandb
+# Get your key at: https://wandb.ai/authorize
+WANDB_API_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
+# Optional: set W&B defaults
+WANDB_PROJECT=api-testing-grpo
+WANDB_ENTITY=your-team-name
+```
+**Three ways to provide these keys:**
+| Method | Command |
+|--------|---------|
+| `.env` file | Create `.env` as shown above, then `source .env` before training |
+| CLI login | `huggingface-cli login` and `wandb login` (stores keys in ~/.cache) |
+| Inline export | `export HF_TOKEN=hf_xxx && export WANDB_API_KEY=xxx` |
+> **Important:** Never commit `.env` to git. It's already in `.gitignore`.
+---
+## Quick Start
+```bash
+cd api_testing_env
+source .venv/bin/activate
+# 1. See what training prompts look like (no GPU needed)
+SHOW_PROMPTS=1 python -m training.grpo
+# 2. Quick sanity check (CPU, ~2 minutes)
+python -m training.grpo --test-mode
+# 3. Real training (GPU required)
+python -m training.grpo --model-id Qwen/Qwen3-1.7B --num-episodes 100
+# 4. With HuggingFace Hub push
+python -m training.grpo \
+  --push-to-hub --hf-repo-id your-username/api-tester-grpo
+# 5. With Weights & Biases logging
+python -m training.grpo \
+  --use-wandb --wandb-project api-testing-grpo
+# 6. Full pipeline: training + HF push + W&B
+python -m training.grpo \
+  --model-id Qwen/Qwen3-1.7B \
+  --num-episodes 100 \
+  --push-to-hub --hf-repo-id your-username/api-tester-grpo \
+  --use-wandb --wandb-project api-testing-grpo
+# 7. Run baseline agents only (no GPU needed)
+python -m training.evaluate --task all --agent all --url http://localhost:8000
+# 8. Resume from checkpoint
+python -m training.grpo --model-id ./checkpoints/step_50
+```
+---
+## How Training Works
+There is **no external dataset**. The environment generates unique episodes on the fly.
+```
+                  ┌─────────────────────────────────────────────┐
+                  │           GRPO Training Loop                │
+                  │                                             │
+  ┌───────────┐   │  1. env.reset(seed=N)                      │
+  │           │   │     → unique users, tasks, data             │
+  │  Qwen     │   │                                             │
+  │  1.7B     │──▶│  2. LLM generates: {"method":"GET",...}     │
+  │  + LoRA   │   │                                             │
+  │           │◀──│  3. env.step(action) → reward               │
+  └───────────┘   │     coverage + bugs + validity              │
+                  │                                             │
+                  │  4. GRPO: generate 4 attempts per prompt,   │
+                  │     keep best, update model weights          │
+                  │                                             │
+                  │  5. Repeat with next seed                   │
+                  └─────────────────────────────────────────────┘
+```
+### Why no dataset file?
+Each `reset(seed=N)` creates a **unique database** with different users, tasks, and data:
+| Seed | Users | Tasks |
+|------|-------|-------|
+| 42 | diana, alice, xander, ivan, hannah | 8 tasks |
+| 99 | mike, george, tom, fiona | 6 tasks |
+| 7 | priya, kevin, wendy | 4 tasks |
+The agent can't memorize "login as alice" because alice might not exist. It must **read the observation and adapt** — that's the learning signal.
+The bugs (13 planted flaws) are structural — same code flaws every episode — but the path to finding them changes because the data is different.
+---
+## Training Pipeline
+The full training pipeline runs these steps automatically:
+```
+1. Run baseline agents (random, sequential, smart) across all tasks
+        ↓
+2. Load base model (Qwen 1.7B)
+        ↓
+3. Evaluate base model before training (establishes LLM baseline)
+        ↓
+4. GRPO training with LoRA
+        ↓
+5. Save model locally to --output-dir
+        ↓
+6. Push to HuggingFace Hub (if --push-to-hub)
+        ↓
+7. Evaluate trained model after GRPO
+        ↓
+8. Print comparison table (baselines vs base vs trained)
+        ↓
+9. Save metrics (JSON + markdown) to output-dir/metrics/
+        ↓
+10. Save comparison plots (PNG) to output-dir/metrics/plots/
+        ↓
+11. Finalize W&B run (if --use-wandb)
+```
+---
+## File Guide
+| File | Purpose | When to modify |
+|------|---------|----------------|
+| `prompts.py` | System prompt, `format_observation()`, `parse_action()` | Change how the LLM sees tasks or formats actions |
+| `rewards.py` | `format_reward_fn()`, `environment_reward_fn()` | Tune reward scaling or add new reward signals |
+| `agents.py` | `RandomAgent`, `SequentialAgent`, `SmartAgent` | Add new baseline strategies |
+| `grpo.py` | `build_training_prompts()`, `train_grpo()` | Change training hyperparameters or model |
+| `evaluate.py` | `run_rollout()`, `run_baseline_local()`, remote runner | Change evaluation logic |
+### prompts.py
+The bridge between the environment and the LLM.
+**`SYSTEM_PROMPT`** — Instructions telling the LLM it's an API tester. Includes output format (JSON) and testing strategies.
+**`format_observation(obs)`** — Converts an environment observation into text:
+- First turn: full API spec + task description + available users
+- Later turns: last response + feedback + progress stats + auth tokens
+**`parse_action(text)`** — Extracts JSON from LLM output. Handles:
+- Raw JSON: `{"method": "GET", "endpoint": "/tasks"}`
+- Code blocks: `` ```json {...} ``` ``
+- Extra text around JSON: `"I'll try: {...}"`
+### rewards.py
+Two reward functions that GRPO uses to score each LLM completion:
+**`format_reward_fn`** — Binary: +1.0 if valid JSON action, -1.0 if not. Teaches the model to always output parseable actions.
+**`environment_reward_fn`** — Runs the action in the environment and returns the actual reward (coverage + bugs + validity), scaled by 5.0 to dominate over format reward.
+### agents.py
+Three hand-coded baselines for comparison:
+| Agent | Strategy | Expected Score |
+|-------|----------|---------------|
+| `RandomAgent` | Random method + random endpoint | ~0.10 |
+| `SequentialAgent` | Fixed sequence: GET, POST, PUT, DELETE each endpoint | ~0.35 |
+| `SmartAgent` | Multi-phase: discover → auth → CRUD → bug hunt → security | ~0.55 |
+A GRPO-trained model should beat the SmartAgent.
+### grpo.py
+The main training script.
+**`build_training_prompts(num_episodes)`** — Creates N prompts by resetting the environment with seeds 0..N. Each prompt is a chat message with system prompt + initial observation.
+**`run_baseline_evaluation(seed)`** — Runs all three baseline agents across all tasks before training starts.
+**`train_grpo(args)`** — Full GRPO loop:
+1. Run baseline agents for comparison
+2. Load model + tokenizer (Qwen 1.7B default)
+3. Evaluate base model before training
+4. Apply LoRA (r=16, alpha=32, targets q_proj + v_proj)
+5. Generate prompts from environment
+6. Create per-prompt environment instances for reward eval
+7. Train with TRL's GRPOTrainer
+8. Save model locally + push to HF Hub
+9. Evaluate trained model + print comparison
+10. Save metrics (JSON, markdown) and plots (PNG)
+11. Finalize W&B run
+**`save_metrics()`** — Saves `results.json` and `results.md` to `output-dir/metrics/`.
+**`save_plots()`** — Generates three comparison bar charts (reward, bugs, coverage) saved as PNGs.
+### evaluate.py
+**`run_rollout(model, tokenizer, task_id, seed)`** — Runs one full episode with a HuggingFace model. Multi-turn: LLM generates action → env steps → LLM sees result → repeats.
+**`run_baseline_local(agent_name, task_id, seed)`** — Runs baseline agents against the local environment (no server needed). Used by `grpo.py` to establish baselines before training.
+**`run_episode(url, task_id, agent_cls)`** — Runs a baseline agent against a remote server via WebSocket.
+---
+## Training Hyperparameters
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `--model-id` | `Qwen/Qwen3-1.7B` | Base model (any HF causal LM) |
+| `--num-episodes` | 50 | Training prompts (more = more diverse episodes) |
+| `--num-generations` | 4 | GRPO rollouts per prompt (higher = better but slower) |
+| `--max-completion-length` | 256 | Max tokens per LLM response |
+| `--max-steps` | 200 | Total training optimizer steps |
+| `--learning-rate` | 2e-5 | AdamW learning rate |
+| `--batch-size` | 1 | Per-device batch size |
+| `--output-dir` | `./checkpoints/grpo_api_tester` | Where to save model |
+| `--push-to-hub` | off | Push trained model to HuggingFace Hub |
+| `--hf-repo-id` | none | HF Hub repo (e.g., `user/api-tester-grpo`) |
+| `--use-wandb` | off | Enable Weights & Biases logging |
+| `--wandb-project` | `api-testing-grpo` | W&B project name |
+| `--wandb-run-name` | auto | W&B run name |
+| `--test-mode` | off | Quick 3-episode, 2-gen, 5-step test |
+### Hardware Requirements
+| Setup | GPU | Time | Model |
+|-------|-----|------|-------|
+| Colab Free | T4 (16GB) | ~1-2 hours | Qwen 1.7B + 4-bit LoRA |
+| Colab Pro | A100 (40GB) | ~30 min | Qwen 4B + LoRA |
+| Local | Any 8GB+ | ~1-2 hours | Qwen 1.7B + 4-bit LoRA |
+| CPU only | None | `--test-mode` only | Verifies pipeline works |
+---
+## Output Structure
+After training, your output directory will look like:
+```
+checkpoints/grpo_api_tester/
+├── adapter_config.json          # LoRA adapter config
+├── adapter_model.safetensors    # Trained LoRA weights
+├── tokenizer.json               # Tokenizer files
+├── tokenizer_config.json
+├── special_tokens_map.json
+└── metrics/
+    ├── results.json             # Full results (baselines + base + trained)
+    ├── results.md               # Markdown comparison table
+    └── plots/
+        ├── reward_comparison.png   # Bar chart: reward across all agents
+        ├── bugs_comparison.png     # Bar chart: bugs found
+        └── coverage_comparison.png # Bar chart: API coverage %
+```
+---
+## Weights & Biases Integration
+When `--use-wandb` is enabled, the following is logged:
+| Metric | Description |
+|--------|-------------|
+| `baseline/{agent}/{task}/reward` | Baseline agent scores |
+| `base_model/{task}/reward` | Pre-training model scores |
+| `trained_model/{task}/reward` | Post-training model scores |
+| `delta/{task}/reward` | Improvement over base model |
+| `plots/*` | Comparison charts as W&B images |
+| TRL defaults | Loss, learning rate, reward mean/std |
+---
+## Expected Results
+### Before Training (base Qwen 1.7B, no fine-tuning)
+The base model can output JSON sometimes, but has no API testing strategy:
+```
+basic_validation:    ~0.15 (random-level)
+edge_cases:          ~0.08
+security_workflows:  ~0.03
+```
+### After GRPO (50 episodes, 200 steps)
+The model learns systematic testing patterns:
+```
+basic_validation:    ~0.55-0.65
+edge_cases:          ~0.35-0.45
+security_workflows:  ~0.25-0.35
+```
+### What the Model Learns
+1. **Output format** — Always produce valid JSON (format reward)
+2. **Coverage** — Test different endpoints, don't repeat the same request
+3. **Dependency chaining** — POST to create, then GET/PUT/DELETE the created resource
+4. **Bug patterns** — Try non-existent IDs, missing fields, invalid emails
+5. **Auth workflows** — Login first, use tokens in subsequent requests
+6. **Security testing** — Try cross-user access, injection payloads
+---
+## Extending the Training
+### Add a new reward signal
+Edit `rewards.py`:
+```python
+def efficiency_reward_fn(completions: list[str], **kwargs) -> list[float]:
+    """Reward for concise, focused actions (penalize wasted steps)."""
+    rewards = []
+    for text in completions:
+        action = parse_action(text)
+        if action and action.expected_status:
+            rewards.append(0.5)  # Bonus for predicting expected status
+        else:
+            rewards.append(0.0)
+    return rewards
+```
+Then add it to the combined reward in `grpo.py`.
+### Add a new baseline agent
+Edit `agents.py`:
+```python
+class CoverageAgent:
+    """Agent that prioritizes hitting every endpoint once."""
+    name = "coverage"
+    def __init__(self):
+        self.tested = set()
+        # ...
+```
+Then add it to the `AGENTS` dict.
+### Use a different model
+```bash
+# Qwen 2.5 (smaller, faster)
+python -m training.grpo --model-id Qwen/Qwen2.5-1.5B
+# Llama 3 (if you have access)
+python -m training.grpo --model-id meta-llama/Llama-3.2-1B
+```
+Any HuggingFace causal language model works — just make sure it supports chat templates.

training/__init__.py ADDED Viewed

	@@ -0,0 +1,10 @@

+"""
+Training module for the API Testing Environment.
+Contains:
+- prompts.py     — System prompt, observation formatting, action parsing
+- rewards.py     — Reward functions for GRPO (format + environment)
+- agents.py      — Baseline agents (random, sequential, smart)
+- grpo.py        — GRPO training loop with TRL, HF Hub push, W&B logging
+- evaluate.py    — Evaluation / rollout runner (local + remote)
+"""

training/agents.py ADDED Viewed

	@@ -0,0 +1,190 @@

+"""
+Baseline agents for the API Testing Environment.
+Three agents of increasing sophistication:
+1. RandomAgent     — Picks random endpoints/methods (lower bound)
+2. SequentialAgent — Systematically tests each endpoint in order
+3. SmartAgent      — Chains requests and probes for known bug patterns
+"""
+import random
+import sys
+import os
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+from models import APITestAction, HTTPMethod
+class RandomAgent:
+    """Randomly picks endpoints and methods. Baseline for comparison."""
+    name = "random"
+    ENDPOINTS = ["/tasks", "/tasks/1", "/tasks/2", "/tasks/999", "/users", "/users/1", "/auth/login"]
+    METHODS = ["GET", "POST", "PUT", "DELETE"]
+    def act(self, observation: dict) -> APITestAction:
+        method = random.choice(self.METHODS)
+        endpoint = random.choice(self.ENDPOINTS)
+        body = None
+        headers = {}
+        if method == "POST" and endpoint == "/tasks":
+            body = {"title": f"Random task {random.randint(1, 100)}"}
+        elif method == "POST" and endpoint == "/auth/login":
+            body = {"username": random.choice(["alice", "bob"]), "password": "pass"}
+        elif method == "POST" and endpoint == "/users":
+            body = {"username": f"user{random.randint(100, 999)}", "email": "test@test.com", "password": "pass"}
+        elif method == "PUT":
+            endpoint = f"/tasks/{random.randint(1, 5)}"
+            body = {"title": "Updated"}
+        return APITestAction(
+            method=HTTPMethod(method) if method in ("GET", "POST", "PUT", "DELETE") else HTTPMethod.GET,
+            endpoint=endpoint,
+            headers=headers,
+            body=body,
+        )
+class SequentialAgent:
+    """Systematically tests each endpoint with valid requests."""
+    name = "sequential"
+    def __init__(self):
+        self.step = 0
+    def act(self, observation: dict) -> APITestAction:
+        self.step += 1
+        actions = self._get_action_sequence()
+        idx = min(self.step - 1, len(actions) - 1)
+        return actions[idx]
+    def _get_action_sequence(self) -> list[APITestAction]:
+        return [
+            APITestAction(method=HTTPMethod.GET, endpoint="/tasks", expected_status=200),
+            APITestAction(method=HTTPMethod.GET, endpoint="/users", expected_status=200),
+            APITestAction(method=HTTPMethod.GET, endpoint="/tasks/1", expected_status=200),
+            APITestAction(method=HTTPMethod.GET, endpoint="/users/1", expected_status=200),
+            APITestAction(method=HTTPMethod.POST, endpoint="/auth/login",
+                          body={"username": "alice", "password": "password123"}, expected_status=200),
+            APITestAction(method=HTTPMethod.POST, endpoint="/tasks",
+                          body={"title": "Test Task", "description": "Created by baseline"}, expected_status=201),
+            APITestAction(method=HTTPMethod.POST, endpoint="/users",
+                          body={"username": "testuser", "email": "test@example.com", "password": "test123"},
+                          expected_status=201),
+            APITestAction(method=HTTPMethod.PUT, endpoint="/tasks/1",
+                          body={"title": "Updated Task"}, expected_status=200),
+            APITestAction(method=HTTPMethod.DELETE, endpoint="/tasks/5", expected_status=200),
+            APITestAction(method=HTTPMethod.GET, endpoint="/tasks/999999", expected_status=404),
+            APITestAction(method=HTTPMethod.POST, endpoint="/tasks",
+                          body={"description": "No title"}, expected_status=400),
+            APITestAction(method=HTTPMethod.GET, endpoint="/tasks",
+                          query_params={"page": -1, "limit": 10}, expected_status=400),
+            APITestAction(method=HTTPMethod.GET, endpoint="/tasks",
+                          query_params={"status": "done"}, expected_status=200),
+            APITestAction(method=HTTPMethod.GET, endpoint="/tasks",
+                          query_params={"sort": "title"}, expected_status=200),
+            APITestAction(method=HTTPMethod.GET, endpoint="/tasks/2", expected_status=200),
+        ]
+class SmartAgent:
+    """Heuristic agent that chains requests and probes for bugs."""
+    name = "smart"
+    def __init__(self):
+        self.step = 0
+        self.auth_tokens = {}
+        self.created_ids = []
+    def act(self, observation: dict) -> APITestAction:
+        self.step += 1
+        if isinstance(observation, dict):
+            self.auth_tokens = observation.get("auth_tokens", self.auth_tokens)
+            ids = observation.get("known_resource_ids", {})
+            for rtype, id_list in ids.items():
+                for rid in id_list:
+                    if rid not in self.created_ids:
+                        self.created_ids.append(rid)
+        actions = self._get_smart_sequence()
+        idx = min(self.step - 1, len(actions) - 1)
+        return actions[idx]
+    def _get_smart_sequence(self) -> list[APITestAction]:
+        alice_token = self.auth_tokens.get("alice", "")
+        bob_token = self.auth_tokens.get("bob", "")
+        alice_auth = {"Authorization": f"Bearer {alice_token}"} if alice_token else {}
+        bob_auth = {"Authorization": f"Bearer {bob_token}"} if bob_token else {}
+        return [
+            # Phase 1: Discovery
+            APITestAction(method=HTTPMethod.GET, endpoint="/tasks", expected_status=200),
+            APITestAction(method=HTTPMethod.GET, endpoint="/users", expected_status=200),
+            # Phase 2: Authentication
+            APITestAction(method=HTTPMethod.POST, endpoint="/auth/login",
+                          body={"username": "alice", "password": "password123"}, expected_status=200),
+            APITestAction(method=HTTPMethod.POST, endpoint="/auth/login",
+                          body={"username": "bob", "password": "password123"}, expected_status=200),
+            # Phase 3: CRUD with auth
+            APITestAction(method=HTTPMethod.POST, endpoint="/tasks",
+                          body={"title": "Alice's task", "description": "Test"},
+                          headers=alice_auth, expected_status=201),
+            APITestAction(method=HTTPMethod.GET, endpoint="/tasks/1", headers=alice_auth, expected_status=200),
+            # Phase 4: Easy bugs
+            APITestAction(method=HTTPMethod.GET, endpoint="/tasks/999999", expected_status=404),
+            APITestAction(method=HTTPMethod.POST, endpoint="/tasks",
+                          body={"description": "no title"}, expected_status=400),
+            APITestAction(method=HTTPMethod.GET, endpoint="/tasks",
+                          query_params={"page": -1, "limit": 10}, expected_status=400),
+            # Phase 5: Medium bugs
+            APITestAction(method=HTTPMethod.PUT, endpoint="/tasks/1",
+                          body={"assignee_email": "not-an-email"}, expected_status=422),
+            APITestAction(method=HTTPMethod.DELETE, endpoint="/tasks/99999", expected_status=404),
+            APITestAction(method=HTTPMethod.GET, endpoint="/tasks",
+                          query_params={"limit": 999999}, expected_status=200),
+            # Phase 6: User bugs
+            APITestAction(method=HTTPMethod.POST, endpoint="/users",
+                          body={"username": "baduser", "email": "invalid-email", "password": "test"},
+                          expected_status=422),
+            APITestAction(method=HTTPMethod.POST, endpoint="/auth/login",
+                          body={"username": "alice", "password": ""}, expected_status=401),
+            # Phase 7: BOLA
+            APITestAction(method=HTTPMethod.GET, endpoint="/tasks/1",
+                          headers=bob_auth, expected_status=403),
+            # Phase 8: Injection
+            APITestAction(method=HTTPMethod.POST, endpoint="/tasks",
+                          body={"title": "test'; DROP TABLE tasks;--"}, expected_status=201),
+            APITestAction(method=HTTPMethod.POST, endpoint="/tasks",
+                          body={"title": "A" * 6000}, expected_status=400),
+            # Phase 9: Cross-user modification
+            APITestAction(method=HTTPMethod.PUT, endpoint="/tasks/1",
+                          body={"title": "Bob modified Alice's task"},
+                          headers=bob_auth, expected_status=403),
+            # Phase 10: State consistency
+            APITestAction(method=HTTPMethod.POST, endpoint="/tasks",
+                          body={"title": "Ephemeral task"}, expected_status=201),
+            APITestAction(method=HTTPMethod.DELETE, endpoint="/tasks/6", expected_status=200),
+            APITestAction(method=HTTPMethod.GET, endpoint="/tasks/6", expected_status=404),
+            # Phase 11: Coverage
+            APITestAction(method=HTTPMethod.GET, endpoint="/tasks",
+                          query_params={"status": "done"}, expected_status=200),
+            APITestAction(method=HTTPMethod.GET, endpoint="/tasks",
+                          query_params={"sort": "title"}, expected_status=200),
+            APITestAction(method=HTTPMethod.GET, endpoint="/users/2", expected_status=200),
+            # Phase 12: Password hash check
+            APITestAction(method=HTTPMethod.POST, endpoint="/users",
+                          body={"username": "newuser2", "email": "valid@email.com", "password": "pass"},
+                          expected_status=201),
+        ]
+AGENTS = {
+    "random": RandomAgent,
+    "sequential": SequentialAgent,
+    "smart": SmartAgent,
+}

training/evaluate.py ADDED Viewed

	@@ -0,0 +1,318 @@

+#!/usr/bin/env python3
+"""
+Evaluation and rollout runner.
+- run_rollout():        Run a single episode with a HuggingFace model
+- run_baseline_local(): Run baseline agents against the local environment
+- run_baseline():       Run baseline agents against a remote server
+- main():              CLI for running baselines
+"""
+import argparse
+import asyncio
+import logging
+import random
+import sys
+import os
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+logger = logging.getLogger(__name__)
+from models import APITestAction, HTTPMethod
+from server.environment import APITestEnvironment
+from .prompts import (
+    PLAN_SYSTEM_PROMPT, format_plan_prompt,
+    parse_action, parse_test_plan,
+)
+from .agents import AGENTS
+def run_rollout(
+    model,
+    tokenizer,
+    task_id: str = "basic_validation",
+    seed: int = 42,
+    max_steps: int | None = None,
+) -> dict:
+    """Run a single episode with a HuggingFace model.
+    Uses PLAN mode: the model generates a full test plan (JSON array) in one shot,
+    then all actions are executed sequentially. This matches how training works.
+    Falls back to multi-turn mode if the model can't produce a valid plan.
+    """
+    import torch
+    import time as _time
+    # Force GPU if available
+    if torch.cuda.is_available():
+        device = torch.device("cuda")
+        # Move model to GPU if it's on CPU
+        if next(model.parameters()).device.type == "cpu":
+            logger.info("  Moving model to GPU...")
+            model = model.to(device)
+    else:
+        device = next(model.parameters()).device
+    env = APITestEnvironment()
+    obs = env.reset(seed=seed, task_id=task_id)
+    actual_max = max_steps or obs.max_steps
+    logger.info(f"  Rollout: {task_id} | max_steps={actual_max} | device={device}")
+    # --- Try plan mode first (matches training) ---
+    plan_prompt = format_plan_prompt(obs)
+    messages = [
+        {"role": "system", "content": PLAN_SYSTEM_PROMPT},
+        {"role": "user", "content": plan_prompt},
+    ]
+    # Qwen3 thinking support
+    chat_kwargs = {}
+    if "qwen3" in str(getattr(model, "name_or_path", "") or "").lower():
+        chat_kwargs["enable_thinking"] = True
+    prompt_text = tokenizer.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True, **chat_kwargs,
+    )
+    inputs = tokenizer(prompt_text, return_tensors="pt").to(device)
+    gen_start = _time.time()
+    print(f"  Generating test plan...", end="", flush=True)
+    with torch.no_grad():
+        output = model.generate(
+            **inputs,
+            max_new_tokens=4096,  # Match training max_completion_length
+            temperature=0.7,
+            do_sample=True,
+            pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
+        )
+    completion = tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
+    gen_time = _time.time() - gen_start
+    print(f" done ({gen_time:.1f}s, {len(completion)} chars)")
+    # Parse the plan
+    actions = parse_test_plan(completion)
+    if actions:
+        logger.info(f"  Plan generated: {len(actions)} actions")
+    else:
+        # Fallback: try single action parse
+        single = parse_action(completion)
+        if single:
+            actions = [single]
+            logger.info("  Plan parse failed, got 1 action from fallback")
+        else:
+            logger.warning("  Failed to parse any actions from model output")
+            # Print first 500 chars of completion for debugging
+            preview = completion[:500].replace("\n", " ")
+            logger.warning(f"  Model output preview: {preview}...")
+            actions = []
+    # Limit to max_steps
+    actions = actions[:actual_max]
+    # Execute all actions
+    total_reward = 0.0
+    for i, action in enumerate(actions):
+        try:
+            obs = env.step(action)
+            total_reward += obs.reward or 0.0
+            method_str = action.method.value if hasattr(action.method, "value") else str(action.method)
+            print(f"  Step {i+1}/{len(actions)}: {method_str} {action.endpoint} -> "
+                  f"{obs.status_code} | reward={obs.reward:.3f} | bugs={obs.bugs_found_so_far}")
+        except Exception as e:
+            print(f"  Step {i+1}/{len(actions)}: ERROR - {e}")
+    # If no actions were generated, show that
+    if not actions:
+        print("  (no valid actions generated)")
+    state = env.state
+    return {
+        "task_id": task_id,
+        "seed": seed,
+        "steps": len(actions),
+        "total_reward": round(total_reward, 4),
+        "bugs_found": state.bugs_found,
+        "total_bugs": state.total_bugs,
+        "coverage_pct": state.coverage_pct,
+        "bugs_found_ids": state.bugs_found_ids,
+    }
+def run_baseline_local(
+    agent_name: str = "all",
+    task_id: str = "all",
+    seed: int = 42,
+) -> list[dict]:
+    """Run baseline agents against the local environment (no server needed).
+    Args:
+        agent_name: "random", "sequential", "smart", or "all"
+        task_id: task ID or "all"
+        seed: random seed
+    Returns:
+        List of result dicts with agent, task_id, total_reward, bugs_found, etc.
+    """
+    tasks = ["basic_validation", "edge_cases", "security_workflows"] if task_id == "all" else [task_id]
+    agents = list(AGENTS.items()) if agent_name == "all" else [(agent_name, AGENTS[agent_name])]
+    results = []
+    for tid in tasks:
+        for aname, agent_cls in agents:
+            random.seed(seed)
+            agent = agent_cls()
+            env = APITestEnvironment()
+            obs = env.reset(seed=seed, task_id=tid)
+            total_reward = 0.0
+            step = 0
+            while not obs.done and step < obs.max_steps:
+                obs_dict = {
+                    "status_code": obs.status_code,
+                    "response_body": obs.response_body,
+                    "feedback": obs.feedback,
+                    "bugs_found_so_far": obs.bugs_found_so_far,
+                    "coverage_summary": obs.coverage_summary,
+                    "known_resource_ids": obs.known_resource_ids,
+                    "auth_tokens": obs.auth_tokens,
+                    "steps_taken": obs.steps_taken,
+                    "max_steps": obs.max_steps,
+                }
+                action = agent.act(obs_dict)
+                obs = env.step(action)
+                total_reward += obs.reward or 0.0
+                step += 1
+            state = env.state
+            result = {
+                "agent": aname,
+                "task_id": tid,
+                "seed": seed,
+                "steps": step,
+                "total_reward": round(total_reward, 4),
+                "bugs_found": state.bugs_found,
+                "total_bugs": state.total_bugs,
+                "coverage_pct": state.coverage_pct,
+                "bugs_found_ids": state.bugs_found_ids,
+            }
+            results.append(result)
+            logger.info(
+                f"  [{aname}] {tid}: reward={result['total_reward']:.4f}, "
+                f"bugs={result['bugs_found']}/{result['total_bugs']}, "
+                f"coverage={result['coverage_pct']:.1f}%"
+            )
+    return results
+# =====================================================================
+# Remote baseline runner (against server via WebSocket client)
+# =====================================================================
+async def run_episode(url: str, task_id: str, agent_cls, seed: int = 42) -> dict:
+    """Run one baseline episode against a remote server."""
+    from client import APITestEnv
+    random.seed(seed)
+    agent = agent_cls()
+    async with APITestEnv(base_url=url) as env:
+        result = await env.reset(task_id=task_id)
+        obs = result.observation
+        logger.info(f"Starting {agent.name} agent on task '{task_id}'")
+        total_reward = 0.0
+        step = 0
+        while not result.done:
+            obs_dict = {
+                "status_code": obs.status_code,
+                "response_body": obs.response_body,
+                "feedback": obs.feedback,
+                "bugs_found_so_far": obs.bugs_found_so_far,
+                "coverage_summary": obs.coverage_summary,
+                "known_resource_ids": obs.known_resource_ids,
+                "auth_tokens": obs.auth_tokens,
+                "steps_taken": obs.steps_taken,
+                "max_steps": obs.max_steps,
+            }
+            action = agent.act(obs_dict)
+            result = await env.step(action)
+            obs = result.observation
+            total_reward += result.reward or 0
+            step += 1
+            method = action.method.value if hasattr(action.method, "value") else str(action.method)
+            logger.info(
+                f"  Step {step}: {method} {action.endpoint} -> "
+                f"{obs.status_code} | reward={result.reward:.4f} | bugs={obs.bugs_found_so_far}"
+            )
+        state = await env.state()
+        return {
+            "task_id": task_id,
+            "agent": agent.name,
+            "total_reward": round(total_reward, 4),
+            "bugs_found": state.bugs_found,
+            "total_bugs": state.total_bugs,
+            "coverage_pct": state.coverage_pct,
+            "steps": step,
+        }
+async def main_async(args):
+    tasks = ["basic_validation", "edge_cases", "security_workflows"] if args.task == "all" else [args.task]
+    agents = list(AGENTS.values()) if args.agent == "all" else [AGENTS[args.agent]]
+    results = []
+    for task_id in tasks:
+        for agent_cls in agents:
+            try:
+                result = await run_episode(args.url, task_id, agent_cls, seed=args.seed)
+                results.append(result)
+                logger.info(
+                    f"\nRESULT: {result['agent']} on {result['task_id']}: "
+                    f"reward={result['total_reward']}, bugs={result['bugs_found']}/{result['total_bugs']}, "
+                    f"coverage={result['coverage_pct']:.1f}%"
+                )
+            except Exception as e:
+                logger.error(f"Error running {agent_cls.name} on {task_id}: {e}", exc_info=True)
+    if results:
+        print("\n" + "=" * 80)
+        print("BASELINE RESULTS SUMMARY")
+        print("=" * 80)
+        print(f"{'Agent':<15} {'Task':<25} {'Score':<10} {'Bugs':<10} {'Coverage':<10}")
+        print("-" * 80)
+        for r in results:
+            print(
+                f"{r['agent']:<15} {r['task_id']:<25} "
+                f"{r['total_reward']:<10.4f} "
+                f"{r['bugs_found']}/{r['total_bugs']:<8} "
+                f"{r['coverage_pct']:<10.1f}%"
+            )
+        print("=" * 80)
+    return results
+def main():
+    parser = argparse.ArgumentParser(description="Baseline agents for API Testing Environment")
+    parser.add_argument("--url", default="http://localhost:8000", help="Environment server URL")
+    parser.add_argument("--task", default="all",
+                        choices=["basic_validation", "edge_cases", "security_workflows", "all"])
+    parser.add_argument("--agent", default="all", choices=["random", "sequential", "smart", "all"])
+    parser.add_argument("--seed", type=int, default=42)
+    args = parser.parse_args()
+    asyncio.run(main_async(args))
+if __name__ == "__main__":
+    main()

training/grpo.py ADDED Viewed

	@@ -0,0 +1,783 @@

+#!/usr/bin/env python3
+"""
+GRPO Training Script for the API Testing Environment.
+Trains a small LLM (Qwen 1.7B) to become an intelligent API tester
+using Group Relative Policy Optimization (GRPO).
+The environment IS the dataset — each reset(seed=N) creates a unique
+episode with different users, tasks, and data. No external dataset needed.
+Features:
+    - Auto-push trained model weights to HuggingFace Hub
+    - Weights & Biases logging for metrics, loss, rewards
+    - Baseline agent evaluation before GRPO (random, sequential, smart)
+    - Base model evaluation before GRPO for comparison
+    - Post-training evaluation with delta reporting
+    - Saves metrics, comparison tables, and plots to output dir
+Usage:
+    # Quick test (CPU, 2 minutes)
+    python -m training.grpo --test-mode
+    # Real training (GPU required)
+    python -m training.grpo --model-id Qwen/Qwen3-1.7B --num-episodes 100
+    # With HF Hub push
+    python -m training.grpo --push-to-hub --hf-repo-id your-username/api-tester-grpo
+    # With Weights & Biases
+    python -m training.grpo --use-wandb --wandb-project api-testing-grpo
+    # See what prompts look like (no GPU needed)
+    SHOW_PROMPTS=1 python -m training.grpo
+    # Resume from checkpoint
+    python -m training.grpo --model-id ./checkpoints/step_50
+"""
+import argparse
+import json
+import logging
+import os
+import sys
+import time
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+# --- Suppress noisy HTTP/download logs ---
+logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
+logger = logging.getLogger(__name__)
+for _noisy in ["httpx", "httpcore", "urllib3", "huggingface_hub", "filelock",
+               "transformers.configuration_utils", "transformers.modeling_utils"]:
+    logging.getLogger(_noisy).setLevel(logging.WARNING)
+# --- MONKEY PATCH FOR LLM-BLENDER ---
+# llm-blender requires TRANSFORMERS_CACHE which was removed in transformers 4.42+
+try:
+    import transformers.utils.hub
+    if not hasattr(transformers.utils.hub, "TRANSFORMERS_CACHE"):
+        transformers.utils.hub.TRANSFORMERS_CACHE = os.getenv("HF_HOME", os.path.expanduser("~/.cache/huggingface/hub"))
+except ImportError:
+    pass
+# ------------------------------------
+from server.environment import APITestEnvironment
+from .prompts import PLAN_SYSTEM_PROMPT, format_plan_prompt
+from .rewards import format_reward_fn, plan_reward_fn, diversity_reward_fn
+from .evaluate import run_rollout, run_baseline_local
+def build_training_prompts(
+    num_episodes: int = 50,
+    task_ids: list[str] | None = None,
+) -> list[dict]:
+    """Generate training prompts for GRPO plan-based training.
+    Each prompt asks the model to output a COMPLETE TEST PLAN (JSON array of actions).
+    The reward function will execute the plan on a fresh environment and score it.
+    """
+    if task_ids is None:
+        task_ids = ["basic_validation", "edge_cases", "security_workflows"]
+    prompts = []
+    env = APITestEnvironment()
+    for i in range(num_episodes):
+        task_id = task_ids[i % len(task_ids)]
+        seed = i * 1000 + 42
+        obs = env.reset(seed=seed, task_id=task_id)
+        user_message = format_plan_prompt(obs)
+        prompt_messages = [
+            {"role": "system", "content": PLAN_SYSTEM_PROMPT},
+            {"role": "user", "content": user_message},
+        ]
+        prompts.append({
+            "prompt": prompt_messages,
+            "task_id": task_id,
+            "seed": seed,
+        })
+    logger.info(f"Generated {len(prompts)} training prompts across tasks: {task_ids}")
+    return prompts
+def run_baseline_evaluation(seed: int = 9999) -> dict:
+    """Run all baseline agents and return results for comparison.
+    Returns:
+        dict with structure: {agent_name: {task_id: result_dict}}
+    """
+    logger.info("=" * 60)
+    logger.info("Running BASELINE AGENT evaluation...")
+    logger.info("=" * 60)
+    results = run_baseline_local(agent_name="all", task_id="all", seed=seed)
+    # Organize by agent -> task
+    organized = {}
+    for r in results:
+        agent = r["agent"]
+        if agent not in organized:
+            organized[agent] = {}
+        organized[agent][r["task_id"]] = r
+    # Print summary table
+    print("\n" + "=" * 90)
+    print("BASELINE AGENT RESULTS")
+    print("=" * 90)
+    print(f"{'Agent':<15} {'Task':<25} {'Reward':<10} {'Bugs':<12} {'Coverage':<10}")
+    print("-" * 90)
+    for agent_name in ["random", "sequential", "smart"]:
+        if agent_name not in organized:
+            continue
+        for task_id in ["basic_validation", "edge_cases", "security_workflows"]:
+            r = organized[agent_name].get(task_id, {})
+            print(
+                f"{agent_name:<15} {task_id:<25} "
+                f"{r.get('total_reward', 0):<10.4f} "
+                f"{r.get('bugs_found', 0)}/{r.get('total_bugs', 0):<10} "
+                f"{r.get('coverage_pct', 0):<10.1f}%"
+            )
+        print("-" * 90)
+    print("=" * 90 + "\n")
+    return organized
+def save_metrics(
+    output_dir: str,
+    baseline_results: dict,
+    base_model_results: dict,
+    trained_model_results: dict,
+    training_args: dict,
+    training_time_s: float,
+):
+    """Save all metrics and comparison data to output_dir/metrics/."""
+    metrics_dir = os.path.join(output_dir, "metrics")
+    os.makedirs(metrics_dir, exist_ok=True)
+    # Full results JSON
+    all_results = {
+        "training_args": training_args,
+        "training_time_seconds": round(training_time_s, 1),
+        "baseline_agents": {},
+        "base_model": base_model_results,
+        "trained_model": trained_model_results,
+    }
+    # Flatten baseline results
+    for agent_name, tasks in baseline_results.items():
+        all_results["baseline_agents"][agent_name] = {}
+        for task_id, r in tasks.items():
+            all_results["baseline_agents"][agent_name][task_id] = {
+                "total_reward": r.get("total_reward", 0),
+                "bugs_found": r.get("bugs_found", 0),
+                "total_bugs": r.get("total_bugs", 0),
+                "coverage_pct": r.get("coverage_pct", 0),
+            }
+    with open(os.path.join(metrics_dir, "results.json"), "w") as f:
+        json.dump(all_results, f, indent=2)
+    # Comparison table as markdown
+    md_lines = ["# Training Results\n"]
+    md_lines.append(f"**Model**: {training_args.get('model_id', 'unknown')}")
+    md_lines.append(f"**Training time**: {training_time_s / 60:.1f} minutes")
+    md_lines.append(f"**Episodes**: {training_args.get('num_episodes', 0)}")
+    md_lines.append(f"**Max steps**: {training_args.get('max_steps', 0)}\n")
+    md_lines.append("## Comparison Table\n")
+    md_lines.append("| Agent/Model | Task | Reward | Bugs | Coverage |")
+    md_lines.append("|---|---|---|---|---|")
+    # Baselines
+    for agent_name in ["random", "sequential", "smart"]:
+        if agent_name not in baseline_results:
+            continue
+        for task_id in ["basic_validation", "edge_cases", "security_workflows"]:
+            r = baseline_results[agent_name].get(task_id, {})
+            md_lines.append(
+                f"| {agent_name} | {task_id} | "
+                f"{r.get('total_reward', 0):.4f} | "
+                f"{r.get('bugs_found', 0)}/{r.get('total_bugs', 0)} | "
+                f"{r.get('coverage_pct', 0):.1f}% |"
+            )
+    # Base model
+    for task_id in ["basic_validation", "edge_cases", "security_workflows"]:
+        r = base_model_results.get(task_id, {})
+        md_lines.append(
+            f"| **base model** | {task_id} | "
+            f"{r.get('total_reward', 0):.4f} | "
+            f"{r.get('bugs_found', 0)}/{r.get('total_bugs', 0)} | "
+            f"{r.get('coverage_pct', 0):.1f}% |"
+        )
+    # Trained model
+    for task_id in ["basic_validation", "edge_cases", "security_workflows"]:
+        r = trained_model_results.get(task_id, {})
+        base = base_model_results.get(task_id, {})
+        delta = r.get("total_reward", 0) - base.get("total_reward", 0)
+        md_lines.append(
+            f"| **GRPO trained** | {task_id} | "
+            f"{r.get('total_reward', 0):.4f} ({delta:+.4f}) | "
+            f"{r.get('bugs_found', 0)}/{r.get('total_bugs', 0)} | "
+            f"{r.get('coverage_pct', 0):.1f}% |"
+        )
+    md_lines.append("")
+    with open(os.path.join(metrics_dir, "results.md"), "w") as f:
+        f.write("\n".join(md_lines))
+    logger.info(f"Metrics saved to {metrics_dir}/")
+def save_plots(output_dir: str, baseline_results: dict, base_model_results: dict, trained_model_results: dict):
+    """Generate and save comparison plots."""
+    try:
+        import matplotlib
+        matplotlib.use("Agg")
+        import matplotlib.pyplot as plt
+        import numpy as np
+    except ImportError:
+        logger.warning("matplotlib not installed — skipping plot generation. pip install matplotlib")
+        return
+    plots_dir = os.path.join(output_dir, "metrics", "plots")
+    os.makedirs(plots_dir, exist_ok=True)
+    tasks = ["basic_validation", "edge_cases", "security_workflows"]
+    task_labels = ["Basic", "Edge Cases", "Security"]
+    # --- Plot 1: Reward comparison bar chart ---
+    fig, ax = plt.subplots(figsize=(12, 6))
+    x = np.arange(len(tasks))
+    width = 0.15
+    agents_to_plot = []
+    for agent_name in ["random", "sequential", "smart"]:
+        if agent_name in baseline_results:
+            rewards = [baseline_results[agent_name].get(t, {}).get("total_reward", 0) for t in tasks]
+            agents_to_plot.append((agent_name, rewards))
+    base_rewards = [base_model_results.get(t, {}).get("total_reward", 0) for t in tasks]
+    agents_to_plot.append(("Base Model", base_rewards))
+    trained_rewards = [trained_model_results.get(t, {}).get("total_reward", 0) for t in tasks]
+    agents_to_plot.append(("GRPO Trained", trained_rewards))
+    colors = ["#95a5a6", "#3498db", "#e67e22", "#9b59b6", "#2ecc71"]
+    for i, (name, rewards) in enumerate(agents_to_plot):
+        offset = (i - len(agents_to_plot) / 2 + 0.5) * width
+        bars = ax.bar(x + offset, rewards, width, label=name, color=colors[i % len(colors)])
+        for bar, val in zip(bars, rewards):
+            if val > 0.01:
+                ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01,
+                        f"{val:.2f}", ha="center", va="bottom", fontsize=7)
+    ax.set_xlabel("Task")
+    ax.set_ylabel("Total Reward")
+    ax.set_title("Reward Comparison: Baselines vs Base Model vs GRPO Trained")
+    ax.set_xticks(x)
+    ax.set_xticklabels(task_labels)
+    ax.legend()
+    ax.set_ylim(bottom=0)
+    plt.tight_layout()
+    fig.savefig(os.path.join(plots_dir, "reward_comparison.png"), dpi=150)
+    plt.close(fig)
+    # --- Plot 2: Bugs found comparison ---
+    fig, ax = plt.subplots(figsize=(12, 6))
+    for i, (name, _) in enumerate(agents_to_plot):
+        if name in baseline_results:
+            bugs = [baseline_results[name].get(t, {}).get("bugs_found", 0) for t in tasks]
+        elif name == "Base Model":
+            bugs = [base_model_results.get(t, {}).get("bugs_found", 0) for t in tasks]
+        else:
+            bugs = [trained_model_results.get(t, {}).get("bugs_found", 0) for t in tasks]
+        offset = (i - len(agents_to_plot) / 2 + 0.5) * width
+        ax.bar(x + offset, bugs, width, label=name, color=colors[i % len(colors)])
+    total_bugs = [base_model_results.get(t, {}).get("total_bugs", 0) or
+                  trained_model_results.get(t, {}).get("total_bugs", 0) for t in tasks]
+    ax.plot(x, total_bugs, "k--", marker="D", label="Total Bugs", linewidth=1.5)
+    ax.set_xlabel("Task")
+    ax.set_ylabel("Bugs Found")
+    ax.set_title("Bug Discovery: Baselines vs Base Model vs GRPO Trained")
+    ax.set_xticks(x)
+    ax.set_xticklabels(task_labels)
+    ax.legend()
+    ax.set_ylim(bottom=0)
+    plt.tight_layout()
+    fig.savefig(os.path.join(plots_dir, "bugs_comparison.png"), dpi=150)
+    plt.close(fig)
+    # --- Plot 3: Coverage comparison ---
+    fig, ax = plt.subplots(figsize=(12, 6))
+    for i, (name, _) in enumerate(agents_to_plot):
+        if name in baseline_results:
+            cov = [baseline_results[name].get(t, {}).get("coverage_pct", 0) for t in tasks]
+        elif name == "Base Model":
+            cov = [base_model_results.get(t, {}).get("coverage_pct", 0) for t in tasks]
+        else:
+            cov = [trained_model_results.get(t, {}).get("coverage_pct", 0) for t in tasks]
+        offset = (i - len(agents_to_plot) / 2 + 0.5) * width
+        ax.bar(x + offset, cov, width, label=name, color=colors[i % len(colors)])
+    ax.set_xlabel("Task")
+    ax.set_ylabel("Coverage %")
+    ax.set_title("API Coverage: Baselines vs Base Model vs GRPO Trained")
+    ax.set_xticks(x)
+    ax.set_xticklabels(task_labels)
+    ax.legend()
+    ax.set_ylim(0, 105)
+    plt.tight_layout()
+    fig.savefig(os.path.join(plots_dir, "coverage_comparison.png"), dpi=150)
+    plt.close(fig)
+    logger.info(f"Plots saved to {plots_dir}/")
+def train_grpo(args):
+    """Run GRPO training with TRL."""
+    try:
+        from datasets import Dataset
+        from peft import LoraConfig
+        from transformers import AutoModelForCausalLM, AutoTokenizer
+        from trl import GRPOConfig, GRPOTrainer
+        # --- MONKEY PATCH FOR TRL GRPOTrainer ---
+        # trl 0.15 lacks `dataset` argument in `_get_train_sampler` required by transformers 4.57+
+        import inspect
+        if hasattr(GRPOTrainer, "_get_train_sampler"):
+            sig = inspect.signature(GRPOTrainer._get_train_sampler)
+            if "dataset" not in sig.parameters:
+                _old_sampler = GRPOTrainer._get_train_sampler
+                def _new_sampler(self, dataset=None, **kwargs):
+                    return _old_sampler(self)
+                GRPOTrainer._get_train_sampler = _new_sampler
+        # ----------------------------------------
+    except ImportError as e:
+        logger.error(
+            f"Missing dependency: {e}\n"
+            "Install with: pip install trl transformers peft datasets torch"
+        )
+        sys.exit(1)
+    # --- W&B setup ---
+    wandb_run = None
+    report_to = "none"
+    if args.use_wandb:
+        try:
+            import wandb
+            wandb_run = wandb.init(
+                project=args.wandb_project,
+                name=args.wandb_run_name or f"grpo-{args.model_id.split('/')[-1]}-{int(time.time())}",
+                config={
+                    "model_id": args.model_id,
+                    "num_episodes": args.num_episodes,
+                    "num_generations": args.num_generations,
+                    "max_steps": args.max_steps,
+                    "learning_rate": args.learning_rate,
+                    "batch_size": args.batch_size,
+                    "max_completion_length": args.max_completion_length,
+                    "lora_r": 16,
+                    "lora_alpha": 32,
+                },
+            )
+            report_to = "wandb"
+            logger.info(f"W&B initialized: project={args.wandb_project}, run={wandb_run.name}")
+        except ImportError:
+            logger.warning("wandb not installed — skipping W&B logging. pip install wandb")
+            args.use_wandb = False
+    training_args_dict = {
+        "model_id": args.model_id,
+        "num_episodes": args.num_episodes,
+        "num_generations": args.num_generations,
+        "max_steps": args.max_steps,
+        "learning_rate": args.learning_rate,
+        "batch_size": args.batch_size,
+        "max_completion_length": args.max_completion_length,
+        "output_dir": args.output_dir,
+        "test_mode": args.test_mode,
+    }
+    # ================================================================
+    #  PIPELINE OVERVIEW
+    # ================================================================
+    total_pipeline_steps = 11
+    def _step(n, msg):
+        bar = "█" * n + "░" * (total_pipeline_steps - n)
+        print(f"\n{'='*70}")
+        print(f"  [{bar}] Step {n}/{total_pipeline_steps}: {msg}")
+        print(f"{'='*70}\n")
+    # --- Step 1: Run baseline agent evaluation ---
+    _step(1, "Running baseline agents (random, sequential, smart)")
+    baseline_results = run_baseline_evaluation(seed=9999)
+    if args.use_wandb and wandb_run:
+        import wandb
+        for agent_name, tasks in baseline_results.items():
+            for task_id, r in tasks.items():
+                wandb.log({
+                    f"baseline/{agent_name}/{task_id}/reward": r["total_reward"],
+                    f"baseline/{agent_name}/{task_id}/bugs": r["bugs_found"],
+                    f"baseline/{agent_name}/{task_id}/coverage": r["coverage_pct"],
+                })
+    # --- Step 2: Load model and tokenizer ---
+    _step(2, f"Loading model: {args.model_id}")
+    print("  Downloading tokenizer...", flush=True)
+    tokenizer = AutoTokenizer.from_pretrained(args.model_id, trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    print("  Tokenizer loaded.", flush=True)
+    import torch
+    # --- Force GPU detection ---
+    if torch.cuda.is_available():
+        device_map = "auto"
+        dtype = torch.bfloat16
+        gpu_name = torch.cuda.get_device_name(0)
+        gpu_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
+        print(f"  GPU: {gpu_name} ({gpu_mem:.1f} GB)", flush=True)
+        print(f"  CUDA version: {torch.version.cuda}", flush=True)
+    elif torch.backends.mps.is_available():
+        device_map = "auto"
+        dtype = torch.float16
+        print("  Device: Apple MPS", flush=True)
+    else:
+        # Still try to use GPU — sometimes torch.cuda.is_available() is False
+        # because of driver issues but CUDA can still work
+        device_map = None
+        dtype = torch.float32
+        print("  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", flush=True)
+        print("  !! WARNING: No GPU detected — running on CPU !!", flush=True)
+        print("  !! Training will be EXTREMELY slow.           !!", flush=True)
+        print("  !! Check: python -c 'import torch; print(torch.cuda.is_available())'", flush=True)
+        print("  !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", flush=True)
+    print("  Downloading model weights...", flush=True)
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model_id,
+        trust_remote_code=True,
+        torch_dtype=dtype,
+        device_map=device_map,
+    )
+    # Verify model is actually on GPU
+    actual_device = next(model.parameters()).device
+    param_count = sum(p.numel() for p in model.parameters()) / 1e6
+    print(f"  Model loaded: {param_count:.0f}M parameters on {actual_device}", flush=True)
+    if torch.cuda.is_available() and actual_device.type != "cuda":
+        print("  Model not on GPU — forcing move to CUDA...", flush=True)
+        model = model.to("cuda")
+        print(f"  Moved to: {next(model.parameters()).device}", flush=True)
+    # --- Step 3: Evaluate base model BEFORE training ---
+    _step(3, f"Evaluating BASE model (before GRPO, max {args.eval_max_steps} steps/task)")
+    base_results = {}
+    if not args.skip_eval:
+        for task_id in ["basic_validation", "edge_cases", "security_workflows"]:
+            result = run_rollout(model, tokenizer, task_id=task_id, seed=9999, max_steps=args.eval_max_steps)
+            base_results[task_id] = result
+            logger.info(
+                f"  [BASE] {task_id}: reward={result['total_reward']:.3f}, "
+                f"bugs={result['bugs_found']}/{result['total_bugs']}, "
+                f"coverage={result['coverage_pct']:.1f}%"
+            )
+            if args.use_wandb and wandb_run:
+                import wandb
+                wandb.log({
+                    f"base_model/{task_id}/reward": result["total_reward"],
+                    f"base_model/{task_id}/bugs": result["bugs_found"],
+                    f"base_model/{task_id}/coverage": result["coverage_pct"],
+                })
+    else:
+        logger.info("Skipping base model evaluation (--skip-eval)")
+        for task_id in ["basic_validation", "edge_cases", "security_workflows"]:
+            base_results[task_id] = {"total_reward": 0, "bugs_found": 0, "total_bugs": 0, "coverage_pct": 0}
+    # --- Step 4: LoRA config ---
+    _step(4, "Configuring LoRA adapters")
+    lora_config = LoraConfig(
+        r=16,
+        lora_alpha=32,
+        lora_dropout=0.05,
+        target_modules=["q_proj", "v_proj"],
+        task_type="CAUSAL_LM",
+    )
+    print(f"  LoRA: r=16, alpha=32, targets=q_proj+v_proj", flush=True)
+    # --- Step 5: Generate training prompts ---
+    _step(5, f"Generating {args.num_episodes} training episodes")
+    raw_prompts = build_training_prompts(num_episodes=args.num_episodes)
+    print(f"  {len(raw_prompts)} prompts across 3 tasks (each with unique seed)", flush=True)
+    # Qwen3 thinking mode: let the model reason before outputting JSON
+    # Requires higher max_completion_length (~2048) to fit <think>...</think> + JSON
+    chat_template_kwargs = {}
+    if "qwen3" in args.model_id.lower():
+        chat_template_kwargs["enable_thinking"] = True
+        logger.info("Qwen3 detected — thinking mode ENABLED (model will reason before acting)")
+    formatted_prompts = []
+    for p in raw_prompts:
+        text = tokenizer.apply_chat_template(
+            p["prompt"], tokenize=False, add_generation_prompt=True,
+            **chat_template_kwargs,
+        )
+        formatted_prompts.append({"prompt": text, "task_id": p["task_id"], "seed": p["seed"]})
+    dataset = Dataset.from_list(formatted_prompts)
+    # Store prompt metadata for the reward function to create fresh envs
+    prompts_meta = [{"seed": p["seed"], "task_id": p["task_id"]} for p in raw_prompts]
+    # Combined reward: format (valid JSON array?) + plan (execute all actions) + diversity (varied requests?)
+    # Each generation gets a FRESH environment — no shared state pollution
+    def combined_reward_fn(completions, **kwargs):
+        fmt = format_reward_fn(completions)
+        plan = plan_reward_fn(completions, prompts_meta=prompts_meta)
+        div = diversity_reward_fn(completions)
+        return [f + p + d for f, p, d in zip(fmt, plan, div)]
+    # --- Step 6: GRPO training ---
+    _step(6, f"GRPO training ({args.max_steps} steps, {args.num_generations} generations/prompt)")
+    config = GRPOConfig(
+        output_dir=args.output_dir,
+        num_generations=args.num_generations,
+        max_completion_length=args.max_completion_length,
+        learning_rate=args.learning_rate,
+        per_device_train_batch_size=args.batch_size,
+        num_train_epochs=1,
+        max_steps=args.max_steps,
+        logging_steps=5,
+        save_steps=50,
+        save_total_limit=3,
+        report_to=report_to,
+        temperature=0.8,
+    )
+    trainer = GRPOTrainer(
+        model=model,
+        args=config,
+        reward_funcs=[combined_reward_fn],
+        train_dataset=dataset,
+        peft_config=lora_config,
+        processing_class=tokenizer,
+    )
+    print(f"  Config: lr={args.learning_rate}, batch={args.batch_size}, "
+          f"max_completion={args.max_completion_length}, temp=0.8", flush=True)
+    print(f"  Rewards: format_reward + plan_reward + diversity_reward", flush=True)
+    print(f"  Training begins... (progress bar below)\n", flush=True)
+    train_start = time.time()
+    trainer.train()
+    training_time = time.time() - train_start
+    print(f"\n  Training completed in {training_time / 60:.1f} minutes", flush=True)
+    # --- Step 7: Save model locally ---
+    _step(7, f"Saving model to {args.output_dir}")
+    trainer.save_model(args.output_dir)
+    tokenizer.save_pretrained(args.output_dir)
+    print(f"  Model + tokenizer saved.", flush=True)
+    # --- Step 8: Push to HuggingFace Hub ---
+    _step(8, "Pushing to HuggingFace Hub" if args.push_to_hub else "HF Hub push (skipped — use --push-to-hub)")
+    if args.push_to_hub:
+        hf_repo = args.hf_repo_id
+        if not hf_repo:
+            logger.error("--hf-repo-id is required when using --push-to-hub")
+        else:
+            try:
+                logger.info(f"Pushing model to HuggingFace Hub: {hf_repo}")
+                trainer.push_to_hub(repo_id=hf_repo, commit_message="GRPO trained API testing agent")
+                tokenizer.push_to_hub(repo_id=hf_repo, commit_message="GRPO trained API testing agent")
+                logger.info(f"Model pushed to https://huggingface.co/{hf_repo}")
+            except Exception as e:
+                logger.error(f"Failed to push to HF Hub: {e}")
+                logger.info("Make sure you're logged in: huggingface-cli login")
+    # --- Step 9: Evaluate AFTER training ---
+    _step(9, f"Evaluating TRAINED model (max {args.eval_max_steps} steps/task)")
+    trained_results = {}
+    if not args.skip_eval:
+        for task_id in ["basic_validation", "edge_cases", "security_workflows"]:
+            result = run_rollout(model, tokenizer, task_id=task_id, seed=9999, max_steps=args.eval_max_steps)
+            trained_results[task_id] = result
+            base = base_results[task_id]
+            reward_delta = result["total_reward"] - base.get("total_reward", 0)
+            bug_delta = result["bugs_found"] - base.get("bugs_found", 0)
+            cov_delta = result["coverage_pct"] - base.get("coverage_pct", 0)
+            logger.info(
+                f"  [TRAINED] {task_id}: reward={result['total_reward']:.3f} ({reward_delta:+.3f}), "
+                f"bugs={result['bugs_found']}/{result['total_bugs']} ({bug_delta:+d}), "
+                f"coverage={result['coverage_pct']:.1f}% ({cov_delta:+.1f}%)"
+            )
+            if args.use_wandb and wandb_run:
+                import wandb
+                wandb.log({
+                    f"trained_model/{task_id}/reward": result["total_reward"],
+                    f"trained_model/{task_id}/bugs": result["bugs_found"],
+                    f"trained_model/{task_id}/coverage": result["coverage_pct"],
+                    f"delta/{task_id}/reward": reward_delta,
+                    f"delta/{task_id}/bugs": bug_delta,
+                    f"delta/{task_id}/coverage": cov_delta,
+                })
+    else:
+        logger.info("Skipping trained model evaluation (--skip-eval)")
+        for task_id in ["basic_validation", "edge_cases", "security_workflows"]:
+            trained_results[task_id] = {"total_reward": 0, "bugs_found": 0, "total_bugs": 0, "coverage_pct": 0}
+    # --- Step 10: Print final comparison table ---
+    _step(10, "Results comparison table")
+    print("=" * 95)
+    print("FINAL COMPARISON: All Agents & Models")
+    print("=" * 95)
+    print(f"{'Agent/Model':<18} {'Task':<25} {'Reward':<10} {'Bugs':<12} {'Coverage':<10}")
+    print("-" * 95)
+    for agent_name in ["random", "sequential", "smart"]:
+        if agent_name in baseline_results:
+            for task_id in ["basic_validation", "edge_cases", "security_workflows"]:
+                r = baseline_results[agent_name].get(task_id, {})
+                print(
+                    f"{agent_name:<18} {task_id:<25} "
+                    f"{r.get('total_reward', 0):<10.4f} "
+                    f"{r.get('bugs_found', 0)}/{r.get('total_bugs', 0):<10} "
+                    f"{r.get('coverage_pct', 0):<10.1f}%"
+                )
+            print("-" * 95)
+    for task_id in ["basic_validation", "edge_cases", "security_workflows"]:
+        r = base_results[task_id]
+        print(
+            f"{'Base Model':<18} {task_id:<25} "
+            f"{r['total_reward']:<10.4f} "
+            f"{r['bugs_found']}/{r['total_bugs']:<10} "
+            f"{r['coverage_pct']:<10.1f}%"
+        )
+    print("-" * 95)
+    for task_id in ["basic_validation", "edge_cases", "security_workflows"]:
+        r = trained_results[task_id]
+        base = base_results[task_id]
+        delta = r["total_reward"] - base["total_reward"]
+        print(
+            f"{'GRPO Trained':<18} {task_id:<25} "
+            f"{r['total_reward']:<10.4f} "
+            f"{r['bugs_found']}/{r['total_bugs']:<10} "
+            f"{r['coverage_pct']:<10.1f}%  ({delta:+.4f})"
+        )
+    print("=" * 95)
+    # --- Step 11: Save metrics & plots ---
+    _step(11, "Saving metrics, plots, and finalizing")
+    save_metrics(
+        output_dir=args.output_dir,
+        baseline_results=baseline_results,
+        base_model_results=base_results,
+        trained_model_results=trained_results,
+        training_args=training_args_dict,
+        training_time_s=training_time,
+    )
+    save_plots(
+        output_dir=args.output_dir,
+        baseline_results=baseline_results,
+        base_model_results=base_results,
+        trained_model_results=trained_results,
+    )
+    # --- Finalize W&B ---
+    if args.use_wandb and wandb_run:
+        import wandb
+        # Log plots as artifacts
+        plots_dir = os.path.join(args.output_dir, "metrics", "plots")
+        if os.path.exists(plots_dir):
+            for fname in os.listdir(plots_dir):
+                if fname.endswith(".png"):
+                    wandb.log({f"plots/{fname.replace('.png', '')}": wandb.Image(os.path.join(plots_dir, fname))})
+        wandb.finish()
+    # ================================================================
+    print(f"\n{'='*70}")
+    print(f"  PIPELINE COMPLETE")
+    print(f"  Training time: {training_time / 60:.1f} minutes")
+    print(f"  Model saved to: {args.output_dir}")
+    print(f"  Metrics: {args.output_dir}/metrics/")
+    print(f"  Plots: {args.output_dir}/metrics/plots/")
+    if args.use_wandb:
+        print(f"  W&B: https://wandb.ai/{args.wandb_project}")
+    if args.push_to_hub and args.hf_repo_id:
+        print(f"  HF Hub: https://huggingface.co/{args.hf_repo_id}")
+    print(f"{'='*70}\n")
+def main():
+    parser = argparse.ArgumentParser(description="GRPO Training for API Testing Agent")
+    # Model & training
+    parser.add_argument("--model-id", default="Qwen/Qwen3-1.7B", help="Base model to fine-tune")
+    parser.add_argument("--output-dir", default="./checkpoints/grpo_api_tester")
+    parser.add_argument("--num-episodes", type=int, default=50, help="Number of training episodes")
+    parser.add_argument("--num-generations", type=int, default=4, help="GRPO parallel rollouts per prompt")
+    parser.add_argument("--max-completion-length", type=int, default=4096,
+                        help="Max tokens per generation. 4096 needed for Qwen3 thinking + JSON plan")
+    parser.add_argument("--max-steps", type=int, default=200, help="Max training steps")
+    parser.add_argument("--learning-rate", type=float, default=2e-5)
+    parser.add_argument("--batch-size", type=int, default=4)
+    parser.add_argument("--test-mode", action="store_true", help="Quick test with tiny config")
+    # HuggingFace Hub
+    parser.add_argument("--push-to-hub", action="store_true", help="Push trained model to HF Hub")
+    parser.add_argument("--hf-repo-id", type=str, default=None,
+                        help="HF Hub repo ID (e.g., your-username/api-tester-grpo)")
+    # Evaluation
+    parser.add_argument("--skip-eval", action="store_true", help="Skip base/trained model evaluation")
+    parser.add_argument("--eval-max-steps", type=int, default=10,
+                        help="Max steps per task during evaluation (default: 10, reduces eval time)")
+    # Weights & Biases
+    parser.add_argument("--use-wandb", action="store_true", help="Enable Weights & Biases logging")
+    parser.add_argument("--wandb-project", type=str, default="api-testing-grpo",
+                        help="W&B project name")
+    parser.add_argument("--wandb-run-name", type=str, default=None,
+                        help="W&B run name (auto-generated if not set)")
+    args = parser.parse_args()
+    if args.test_mode:
+        logger.info("=== TEST MODE — quick sanity check ===")
+        args.num_episodes = 3
+        args.num_generations = 4
+        args.batch_size = 2
+        args.max_steps = 10
+        args.max_completion_length = 2048
+    if os.environ.get("SHOW_PROMPTS"):
+        prompts = build_training_prompts(num_episodes=3)
+        for p in prompts:
+            print(f"\n{'='*60}")
+            print(f"Task: {p['task_id']} | Seed: {p['seed']}")
+            print(f"{'='*60}")
+            for msg in p["prompt"]:
+                print(f"[{msg['role']}]: {msg['content'][:300]}...")
+        return
+    train_grpo(args)
+if __name__ == "__main__":
+    main()

training/prompts.py ADDED Viewed

	@@ -0,0 +1,398 @@

+"""
+Prompt formatting and action parsing for LLM-based API testing agents.
+- SYSTEM_PROMPT: Instructions for the LLM on how to test APIs
+- format_observation(): Converts environment observations into LLM prompts
+- parse_action(): Extracts a single JSON action from LLM text
+- parse_test_plan(): Extracts a JSON array of actions (for GRPO training)
+"""
+import json
+import re
+import sys
+import os
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+from models import APITestAction, HTTPMethod
+# =====================================================================
+# System prompt for multi-turn evaluation (one action at a time)
+# =====================================================================
+SYSTEM_PROMPT = """\
+You are an expert API security tester. You are testing a REST API for bugs.
+You will receive:
+- The API specification (available endpoints)
+- Results from your previous requests
+- Coverage and bug discovery progress
+Your job: find as many bugs as possible by sending HTTP requests.
+Think step by step about what to test next, then output your action as JSON.
+RESPOND WITH EXACTLY ONE JSON ACTION per turn:
+```json
+{
+  "method": "GET|POST|PUT|DELETE",
+  "endpoint": "/path",
+  "headers": {},
+  "query_params": {},
+  "body": null,
+  "expected_status": 200
+}
+```
+TESTING STRATEGIES:
+- Test each endpoint with valid inputs first
+- Try invalid inputs (missing fields, wrong types, boundary values)
+- Test with non-existent resource IDs
+- Login as different users and test cross-user access
+- Try SQL injection patterns in text fields
+- Test with very long inputs
+- Chain operations: create -> read -> update -> delete
+"""
+# =====================================================================
+# System prompt for GRPO training (full test plan in one shot)
+# =====================================================================
+PLAN_SYSTEM_PROMPT = """\
+You are an expert API security tester. You will receive an API specification and must output a COMPLETE TEST PLAN as a JSON array of HTTP requests to execute in order.
+Your goal: find as many bugs as possible through systematic testing.
+OUTPUT FORMAT — a JSON array of actions to execute sequentially:
+```json
+[
+  {"method": "GET", "endpoint": "/tasks", "headers": {}, "query_params": {}, "body": null, "expected_status": 200},
+  {"method": "POST", "endpoint": "/auth/login", "headers": {}, "query_params": {}, "body": {"username": "alice", "password": "pass"}, "expected_status": 200},
+  ...more actions...
+]
+```
+OUTPUT EXACTLY ONE JSON ARRAY. No other text.
+TESTING STRATEGY — follow this order:
+1. DISCOVER: GET /tasks, GET /users to see what exists
+2. AUTHENTICATE: Login as two different users (POST /auth/login)
+3. CRUD: POST to create, GET to read, PUT to update, DELETE to remove
+4. MISSING FIELDS: POST /tasks without required "title" field
+5. NON-EXISTENT IDs: GET /tasks/999999 (expect 404 — if you get 200, that's a bug!)
+6. BOUNDARY: GET /tasks?page=-1&limit=10 (negative page), GET /tasks?limit=999999 (huge limit)
+7. INVALID DATA: PUT /tasks/1 with assignee_email="not-an-email"
+8. SECURITY: Login as user B, then try to GET/PUT/DELETE user A's resources (BOLA test)
+9. INJECTION: POST /tasks with title containing SQL injection like "'; DROP TABLE tasks;--"
+10. EMPTY AUTH: POST /auth/login with empty password (should fail but might not)
+11. DATA LEAKS: POST /users and check if response includes password_hash
+12. STATE: DELETE a task, then GET it again (should be 404)
+13. LONG INPUT: POST /tasks with a title of 6000+ characters
+COMMON BUG PATTERNS TO TEST:
+- API returns 200 with null body instead of 404 for missing resources
+- API returns 500 instead of 400 for invalid input
+- API accepts any password (even empty string) for login
+- Users can access other users' resources (no authorization check)
+- Response includes sensitive fields like password_hash
+- No input length limits (very long strings crash the server)
+- SQL/HTML injection payloads stored without sanitization
+- DELETE returns 200 even for non-existent resources
+- No pagination limit cap (limit=999999 accepted)
+RULES:
+- Output 15-25 actions
+- Each action MUST have "method" and "endpoint"
+- Vary your requests — never repeat the same action
+- Use the usernames from the task description for login
+"""
+def format_observation(obs) -> str:
+    """Convert an observation into a human-readable prompt for the LLM.
+    Used in multi-turn evaluation (one action at a time).
+    """
+    parts = []
+    if obs.steps_taken == 0:
+        parts.append(f"TASK: {obs.task_description}")
+        parts.append(f"\nSTEPS REMAINING: {obs.max_steps}")
+        parts.append("\nAVAILABLE ENDPOINTS:")
+        for ep in obs.available_endpoints:
+            line = f"  {ep['method']} {ep['path']} — {ep.get('summary', '')}"
+            parts.append(line)
+        parts.append("\nBegin testing. Send your first request as JSON.")
+    else:
+        parts.append(f"STEP {obs.steps_taken}/{obs.max_steps}")
+        parts.append(f"RESPONSE: HTTP {obs.status_code}")
+        resp = obs.response_body
+        if isinstance(resp, (dict, list)):
+            resp_str = json.dumps(resp, indent=2)
+            if len(resp_str) > 500:
+                resp_str = resp_str[:500] + "\n... (truncated)"
+        else:
+            resp_str = str(resp)[:500]
+        parts.append(f"BODY:\n{resp_str}")
+        parts.append(f"\nFEEDBACK: {obs.feedback}")
+        coverage = obs.coverage_summary
+        parts.append(
+            f"\nPROGRESS: Bugs found: {obs.bugs_found_so_far} | "
+            f"Coverage: {coverage.get('coverage_pct', 0):.0f}% | "
+            f"Endpoints tested: {coverage.get('endpoints_tested', 0)}/{coverage.get('total_endpoints', 0)}"
+        )
+        if obs.auth_tokens:
+            parts.append(f"AUTH TOKENS: {list(obs.auth_tokens.keys())}")
+        if obs.known_resource_ids:
+            parts.append(f"CREATED RESOURCES: {dict(obs.known_resource_ids)}")
+        parts.append("\nSend your next request as JSON.")
+    return "\n".join(parts)
+def format_plan_prompt(obs) -> str:
+    """Convert the initial observation into a prompt for generating a full test plan.
+    Used in GRPO training (model outputs a complete plan in one completion).
+    """
+    parts = []
+    parts.append(f"TASK: {obs.task_description}")
+    parts.append(f"\nYou have {obs.max_steps} actions to find as many bugs as possible.")
+    parts.append("\nAVAILABLE ENDPOINTS:")
+    for ep in obs.available_endpoints:
+        summary = ep.get("summary", "")
+        parts.append(f"  {ep['method']} {ep['path']} — {summary}")
+        # Show request body schema if available
+        req_body = ep.get("request_body", {})
+        if req_body:
+            props = req_body.get("properties", {})
+            required = req_body.get("required", [])
+            if props:
+                fields = []
+                for fname, finfo in props.items():
+                    req_mark = " (required)" if fname in required else ""
+                    fields.append(f"{fname}: {finfo.get('type', 'any')}{req_mark}")
+                parts.append(f"    Body: {', '.join(fields)}")
+        # Show parameters if available
+        params = ep.get("parameters", [])
+        if params:
+            param_strs = [f"{p['name']}: {p.get('type', 'any')}" for p in params]
+            parts.append(f"    Params: {', '.join(param_strs)}")
+    parts.append("\nOutput your complete test plan as a JSON array of actions.")
+    return "\n".join(parts)
+def parse_action(text: str) -> APITestAction | None:
+    """Parse a single JSON action from LLM output.
+    Used in multi-turn evaluation.
+    """
+    # Strip Qwen3 thinking blocks
+    if "</think>" in text:
+        text = text.split("</think>", 1)[-1]
+    json_match = re.search(r'\{[^{}]*"method"[^{}]*\}', text, re.DOTALL)
+    if not json_match:
+        json_match = re.search(r'```(?:json)?\s*(\{.*?\})\s*```', text, re.DOTALL)
+        if json_match:
+            json_str = json_match.group(1)
+        else:
+            return None
+    else:
+        json_str = json_match.group(0)
+    try:
+        data = json.loads(json_str)
+    except json.JSONDecodeError:
+        return None
+    return _dict_to_action(data)
+def parse_test_plan(text: str) -> list[APITestAction]:
+    """Parse a JSON array of actions from LLM output.
+    Handles all of these formats:
+        1. Raw JSON array: [{"method": ...}, ...]
+        2. Wrapped object: {"actions": [...]} or {"plan": [...]} or {"test_plan": [...]}
+        3. Markdown code block: ```json [...] ```
+        4. Trailing commas, missing commas (best-effort repair)
+        5. Brace-balanced extraction of individual action objects
+    """
+    if not text:
+        return []
+    # Strip Qwen3 thinking blocks
+    if "</think>" in text:
+        text = text.split("</think>", 1)[-1]
+    # Strip markdown code fences
+    text = re.sub(r'```(?:json)?\s*', '', text)
+    text = text.replace('```', '')
+    data = None
+    # Strategy 1: Try to parse the entire text as JSON
+    try:
+        data = json.loads(text.strip())
+    except json.JSONDecodeError:
+        pass
+    # Strategy 2: Find a top-level JSON ARRAY via brace matching
+    if data is None:
+        start = text.find('[')
+        if start >= 0:
+            depth = 0
+            for i in range(start, len(text)):
+                if text[i] == '[':
+                    depth += 1
+                elif text[i] == ']':
+                    depth -= 1
+                    if depth == 0:
+                        candidate = text[start:i+1]
+                        try:
+                            data = json.loads(candidate)
+                            break
+                        except json.JSONDecodeError:
+                            cleaned = re.sub(r',(\s*[\]}])', r'\1', candidate)
+                            try:
+                                data = json.loads(cleaned)
+                                break
+                            except json.JSONDecodeError:
+                                pass
+    # Strategy 2b: Find a top-level JSON OBJECT (might be {"actions": [...]})
+    if data is None:
+        start = text.find('{')
+        if start >= 0:
+            depth = 0
+            for i in range(start, len(text)):
+                if text[i] == '{':
+                    depth += 1
+                elif text[i] == '}':
+                    depth -= 1
+                    if depth == 0:
+                        candidate = text[start:i+1]
+                        try:
+                            parsed = json.loads(candidate)
+                            # Only accept if it's a wrapper containing actions
+                            if isinstance(parsed, dict) and any(
+                                k in parsed for k in ("actions", "plan", "test_plan", "steps", "requests")
+                            ):
+                                data = parsed
+                                break
+                        except json.JSONDecodeError:
+                            cleaned = re.sub(r',(\s*[\]}])', r'\1', candidate)
+                            try:
+                                parsed = json.loads(cleaned)
+                                if isinstance(parsed, dict) and any(
+                                    k in parsed for k in ("actions", "plan", "test_plan", "steps", "requests")
+                                ):
+                                    data = parsed
+                                    break
+                            except json.JSONDecodeError:
+                                pass
+    # Strategy 3: Extract individual {"method": ...} objects with brace balancing
+    if data is None:
+        objects = []
+        i = 0
+        while i < len(text):
+            if text[i] == '{':
+                depth = 1
+                start = i
+                i += 1
+                while i < len(text) and depth > 0:
+                    if text[i] == '{':
+                        depth += 1
+                    elif text[i] == '}':
+                        depth -= 1
+                    i += 1
+                obj_str = text[start:i]
+                if '"method"' in obj_str:
+                    try:
+                        obj = json.loads(obj_str)
+                        objects.append(obj)
+                    except json.JSONDecodeError:
+                        cleaned = re.sub(r',(\s*[\]}])', r'\1', obj_str)
+                        try:
+                            obj = json.loads(cleaned)
+                            objects.append(obj)
+                        except json.JSONDecodeError:
+                            pass
+            else:
+                i += 1
+        if objects:
+            data = objects
+    if data is None:
+        return []
+    # Unwrap common container shapes: {"actions": [...]}, {"plan": [...]}, etc.
+    if isinstance(data, dict):
+        for key in ("actions", "plan", "test_plan", "steps", "requests"):
+            if key in data and isinstance(data[key], list):
+                data = data[key]
+                break
+        else:
+            # Single action object
+            data = [data]
+    if not isinstance(data, list):
+        data = [data]
+    actions = []
+    for item in data:
+        if isinstance(item, dict) and "method" in item:
+            action = _dict_to_action(item)
+            if action:
+                actions.append(action)
+    return actions
+def _dict_to_action(data: dict) -> APITestAction | None:
+    """Convert a dict to an APITestAction."""
+    method = str(data.get("method", "GET")).upper()
+    if method not in ("GET", "POST", "PUT", "DELETE", "PATCH"):
+        method = "GET"
+    endpoint = data.get("endpoint", "/tasks")
+    if not isinstance(endpoint, str):
+        endpoint = str(endpoint)
+    if not endpoint.startswith("/"):
+        endpoint = "/" + endpoint
+    headers = data.get("headers") or {}
+    if not isinstance(headers, dict):
+        headers = {}
+    query_params = data.get("query_params") or {}
+    if not isinstance(query_params, dict):
+        query_params = {}
+    body = data.get("body")
+    if body is not None and not isinstance(body, dict):
+        body = None
+    expected = data.get("expected_status")
+    if expected is not None:
+        try:
+            expected = int(expected)
+        except (ValueError, TypeError):
+            expected = None
+    return APITestAction(
+        method=HTTPMethod(method),
+        endpoint=endpoint,
+        headers=headers,
+        query_params=query_params,
+        body=body,
+        expected_status=expected,
+    )

training/rewards.py ADDED Viewed

	@@ -0,0 +1,209 @@

+"""
+Reward functions for GRPO training (v2 — plan-based).
+The model outputs a FULL TEST PLAN (JSON array of actions).
+Each reward function creates a FRESH environment, executes ALL actions,
+and scores the result.
+Three reward signals:
+1. format_reward    — Valid JSON array with 3+ diverse actions? (+2 / -2)
+2. plan_reward      — Execute plan, score on bugs + coverage + efficiency (0 to ~8)
+3. diversity_reward — Variety of methods, endpoints, and request patterns (+0 to +2)
+"""
+import re
+import sys
+import os
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+from models import APITestAction, HTTPMethod
+from server.environment import APITestEnvironment
+from .prompts import parse_test_plan
+def format_reward_fn(completions: list[str], **kwargs) -> list[float]:
+    """Reward for valid JSON test plan format.
+    +2.0 if output has 5+ diverse actions (a real plan)
+    +1.0 if output has 3-4 actions (minimal plan)
+    +0.0 if output has 1-2 actions (barely valid)
+    -2.0 if it can't be parsed at all
+    Also penalizes if all actions are identical.
+    """
+    rewards = []
+    for text in completions:
+        actions = parse_test_plan(text)
+        if not actions:
+            rewards.append(-2.0)
+            continue
+        n = len(actions)
+        # Check diversity — are the actions actually different?
+        unique_pairs = set()
+        for a in actions:
+            m = a.method.value if hasattr(a.method, "value") else str(a.method)
+            ep = re.sub(r'/\d+', '/{id}', a.endpoint)
+            unique_pairs.add((m, ep))
+        diversity_ratio = len(unique_pairs) / max(n, 1)
+        if n >= 5 and diversity_ratio >= 0.5:
+            rewards.append(2.0)
+        elif n >= 3:
+            rewards.append(1.0)
+        elif n >= 1:
+            rewards.append(0.0)
+        else:
+            rewards.append(-2.0)
+        # Penalty if all actions are the same
+        if len(unique_pairs) <= 1 and n > 1:
+            rewards[-1] = -1.0
+    return rewards
+def plan_reward_fn(completions: list[str], **kwargs) -> list[float]:
+    """Execute the full test plan in a FRESH environment and return a balanced score.
+    Score components:
+    - Bug discovery:  min(bugs_found, 5) * 1.0  (capped at 5.0 to not dominate)
+    - Coverage:       (coverage_pct / 100) * 2.0 (up to 2.0)
+    - Efficiency:     if bugs > 0: +0.5 per bug found in first 10 actions
+    - Crash penalty:  -0.1 per action that caused a 500 error
+    Total range: roughly -2 to +8
+    Each completion gets its OWN fresh environment — no state pollution.
+    """
+    prompts_meta = kwargs.get("prompts_meta", [])
+    rewards = []
+    for i, text in enumerate(completions):
+        actions = parse_test_plan(text)
+        if not actions:
+            rewards.append(-1.0)
+            continue
+        # Get episode seed and task
+        meta = prompts_meta[i % len(prompts_meta)] if prompts_meta else {}
+        seed = meta.get("seed", 42)
+        task_id = meta.get("task_id", "basic_validation")
+        # Create a FRESH environment
+        env = APITestEnvironment()
+        env.reset(seed=seed, task_id=task_id)
+        # Execute all actions, track results
+        crashes = 0
+        step_rewards = []
+        for action in actions:
+            try:
+                obs = env.step(action)
+                step_rewards.append(obs.reward or 0.0)
+                if obs.status_code >= 500:
+                    crashes += 1
+            except Exception:
+                step_rewards.append(0.0)
+                crashes += 1
+        state = env.state
+        coverage = state.coverage_pct
+        # Component 1: Bug discovery (capped to prevent domination)
+        bug_score = min(state.bugs_found, 5) * 1.0
+        # Component 2: Coverage (proportional, up to 2.0)
+        coverage_score = (coverage / 100) * 2.0
+        # Component 3: Efficiency — finding bugs early is better
+        early_bug_bonus = 0.0
+        early_steps = step_rewards[:10]
+        for r in early_steps:
+            if r > 0.2:  # High reward step = likely found a bug
+                early_bug_bonus += 0.3
+        # Component 4: Crash penalty
+        crash_penalty = crashes * -0.1
+        # Component 5: Step reward sum (small weight — mainly for gradient signal)
+        step_sum = sum(step_rewards) * 0.2
+        total = bug_score + coverage_score + early_bug_bonus + crash_penalty + step_sum
+        rewards.append(round(total, 4))
+    return rewards
+def diversity_reward_fn(completions: list[str], **kwargs) -> list[float]:
+    """Reward for diverse test plans — varied methods, endpoints, and strategies.
+    Components:
+    - Method variety:     up to +0.5 (using GET/POST/PUT/DELETE)
+    - Endpoint variety:   up to +0.5 (testing different endpoints)
+    - Strategy variety:   up to +0.5 (auth + invalid input + boundary + injection patterns)
+    - Repetition penalty: up to -0.5
+    """
+    rewards = []
+    for text in completions:
+        actions = parse_test_plan(text)
+        if not actions:
+            rewards.append(0.0)
+            continue
+        methods = set()
+        endpoints = set()
+        unique_pairs = set()
+        has_auth = False
+        has_invalid_input = False
+        has_boundary = False
+        has_injection = False
+        has_nonexistent_id = False
+        for a in actions:
+            m = a.method.value if hasattr(a.method, "value") else str(a.method)
+            methods.add(m)
+            norm_ep = re.sub(r'/\d+', '/{id}', a.endpoint)
+            endpoints.add(norm_ep)
+            unique_pairs.add((m, norm_ep))
+            # Detect testing strategies
+            if a.endpoint == "/auth/login":
+                has_auth = True
+            if a.body and not a.body.get("title") and a.method.value == "POST":
+                has_invalid_input = True
+            qp = a.query_params or {}
+            if any(isinstance(v, (int, float)) and v < 0 for v in qp.values()):
+                has_boundary = True
+            if any(isinstance(v, (int, float)) and v > 10000 for v in qp.values()):
+                has_boundary = True
+            if a.body and any("DROP" in str(v).upper() or "script" in str(v).lower()
+                              for v in (a.body or {}).values()):
+                has_injection = True
+            if re.search(r'/\d{4,}', a.endpoint):
+                has_nonexistent_id = True
+        # Method variety (max 4 methods = +0.5)
+        method_score = min(len(methods) / 4, 1.0) * 0.5
+        # Endpoint variety (max 7 endpoints = +0.5)
+        endpoint_score = min(len(endpoints) / 7, 1.0) * 0.5
+        # Strategy variety (each strategy = +0.1, max +0.5)
+        strategies = sum([has_auth, has_invalid_input, has_boundary, has_injection, has_nonexistent_id])
+        strategy_score = min(strategies * 0.1, 0.5)
+        # Repetition penalty
+        if len(actions) > 0:
+            repeat_count = len(actions) - len(unique_pairs)
+            repetition_penalty = min(repeat_count / len(actions), 1.0) * -0.5
+        else:
+            repetition_penalty = 0.0
+        total = method_score + endpoint_score + strategy_score + repetition_penalty
+        rewards.append(round(total, 3))
+    return rewards

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff