Spaces:

Mayank022
/

api-testing-env

Running

App Files Files Community

Mayank022 commited on about 8 hours ago

Commit

bafcc7e

verified ·

1 Parent(s): ea4f1cd

Upload folder using huggingface_hub

Browse files

Files changed (16) hide show

.gitattributes +5 -0
README.md +172 -237
gradio_app.py +19 -1
plots/baseline_comparison_matplotlib.png +0 -0
plots/baseline_comparison_matplotlib.svg +2540 -0
plots/baseline_comparison_plotly.png +3 -0
plots/baseline_comparison_plotly.svg +1 -0
plots/environment_architecture.png +3 -0
plots/environment_state_machine.png +3 -0
plots/episode_lifecycle.svg +1 -0
plots/inference_results_matplotlib.png +0 -0
plots/inference_results_matplotlib.svg +2993 -0
plots/inference_results_plotly.png +3 -0
plots/inference_results_plotly.svg +1 -0
plots/plot_inference_results.py +350 -0
plots/reward_signal_function.png +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+plots/baseline_comparison_plotly.png filter=lfs diff=lfs merge=lfs -text
+plots/environment_architecture.png filter=lfs diff=lfs merge=lfs -text
+plots/environment_state_machine.png filter=lfs diff=lfs merge=lfs -text
+plots/inference_results_plotly.png filter=lfs diff=lfs merge=lfs -text
+plots/reward_signal_function.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,120 +1,172 @@
 ---
 title: API Testing Environment
-emoji: 🛡️
-colorFrom: indigo
-colorTo: purple
 sdk: docker
 app_port: 8000
 base_path: /ui/
-pinned: false
 license: mit
 tags:
   - openenv
 ---
-# API Testing Environment for OpenEnv
-An RL environment that trains AI agents to become **automated API security testers** — discovering endpoints, crafting requests, finding vulnerabilities mapped to the **OWASP API Security Top 10**, and generating structured bug bounty reports.
-The agent explores a deliberately buggy Task Management API with 13 planted vulnerabilities across 6 OWASP categories. It earns rewards for coverage, correctness, and bug discovery. At episode end, a security assessment report is auto-generated.
 ---
-## Why This Matters
-- Every software team tests APIs manually or with hand-written test suites
-- Existing tools (Postman, Schemathesis, OWASP ZAP) require manual test design or brute-force fuzzing
-- Academic research shows RL **outperforms traditional tools** in coverage and fault-finding (ARAT-RL, IEEE/ACM 2023; APIRL, AAAI 2025)
-- This environment provides a standardized RL training ground with **verifiable rewards** — deterministic bug detection, not LLM judges
 ---
-## OWASP Coverage
-All 13 bugs are mapped to the OWASP API Security Top 10 (2023):
-| OWASP Category | Bugs | Description |
-|---------------|------|-------------|
-| **API1** Broken Object Level Authorization | BUG_TASK_07, BUG_AUTH_01 | Users can access/modify other users' resources |
-| **API2** Broken Authentication | BUG_AUTH_02 | Login succeeds with empty password |
-| **API3** Broken Object Property Level Auth | BUG_USER_02 | Response exposes password_hash field |
-| **API4** Unrestricted Resource Consumption | BUG_TASK_06, BUG_TASK_08 | No pagination cap, long input crashes server |
-| **API8** Security Misconfiguration | BUG_TASK_01-05, BUG_TASK_09, BUG_USER_01 | Wrong status codes, missing validation, stored injection |
 ---
 ## Architecture
-```
-┌──────────────────────────────────────────────────────────┐
-│                   OpenEnv Server (:8000)                  │
-│                                                          │
-│  Agent ──action──> environment.py                        │
-│        <──obs────  │                                     │
-│                    ├──> buggy_api/ (in-process FastAPI)   │
-│                    │    └── routes/ (tasks, users, auth)  │
-│                    │    └── database.py (SQLite, reset    │
-│                    │        with seed for randomization)  │
-│                    │                                     │
-│                    ├──> bug_detector.py (13 detectors)   │
-│                    ├──> reward.py (5-signal rewards)     │
-│                    └──> graders.py (scoring + bug report)│
-└──────────────────────────────────────────────────────────┘
-```
-Each `reset(seed=N)` creates a unique database with different users, tasks, and data — preventing memorization during GRPO training.
 ---
-## Planted Bugs (13 vulnerabilities)
-| ID | Severity | OWASP | Description |
-|----|----------|-------|-------------|
-| BUG_TASK_01 | Easy | API8 | GET /tasks/{id} returns 200+null for missing task (should be 404) |
-| BUG_TASK_02 | Easy | API8 | POST /tasks without title returns 500 (should be 400) |
-| BUG_TASK_03 | Easy | API8 | GET /tasks?page=-1 returns 200 (should be 400) |
-| BUG_TASK_04 | Medium | API8 | PUT accepts invalid email format without validation |
-| BUG_TASK_05 | Medium | API8 | DELETE returns 200 for non-existent task (should be 404) |
-| BUG_TASK_06 | Medium | API4 | No pagination cap — limit=999999 accepted |
-| BUG_USER_01 | Medium | API8 | POST /users accepts invalid email |
-| BUG_USER_02 | Medium | API3 | POST /users response exposes password_hash |
-| BUG_AUTH_02 | Medium | API2 | Login with empty password succeeds |
-| BUG_TASK_07 | Hard | API1 | BOLA: any user can access any task (no ownership check) |
-| BUG_TASK_08 | Hard | API4 | Long title (>5000 chars) crashes server with 500 |
-| BUG_TASK_09 | Hard | API8 | SQL injection payload stored verbatim |
-| BUG_AUTH_01 | Hard | API1 | User A's token can modify User B's tasks |
 ---
-## Tasks (3 difficulty levels)
-| Task | Difficulty | Steps | Bugs | Focus |
-|------|-----------|-------|------|-------|
-| basic_validation | Easy | 25 | 3 | CRUD testing, status code verification |
-| edge_cases | Medium | 35 | 9 | Invalid inputs, boundary values, chaining |
-| security_workflows | Hard | 45 | 13 | BOLA, auth bypass, injection, state consistency |
 ---
-## Reward Function
-Multi-signal partial rewards at each step:
-| Signal | Range | Purpose |
-|--------|-------|---------|
-| **Coverage** | 0.0 - 0.20 | New endpoints, methods, status codes |
-| **Validity** | 0.0 - 0.18 | Well-formed requests, dependency chaining |
-| **Bug discovery** | 0.0 - 0.30 | Severity-scaled: easy=0.10, medium=0.15, hard=0.25 |
-| **Exploration** | 0.0 - 0.05 | Novel action patterns |
-| **Penalty** | -0.08 | Exact duplicate requests |
-Final episode score (0.0 - 1.0) from task-specific grader + auto-generated bug bounty report.
 ---
-## Bug Bounty Report
-At episode end, the environment auto-generates a structured security assessment report:
 ```
 ## API Security Assessment Report
@@ -123,21 +175,23 @@ At episode end, the environment auto-generates a structured security assessment
 **Critical/Hard:** 0 | **Medium:** 1 | **Low/Easy:** 2
 ### MEDIUM: Login with empty password succeeds
-- **ID:** BUG_AUTH_02
-- **OWASP:** API2:2023 Broken Authentication
-- **Recommendation:** Validate password is non-empty and verify against stored hash
 ### LOW: GET /tasks/{id} returns 200 with null for non-existent task
-- **ID:** BUG_TASK_01
-- **OWASP:** API8:2023 Security Misconfiguration
-- **Recommendation:** Return 404 Not Found for non-existent resources
 ```
 ---
-## Setup & Usage
-### Local Development
 ```bash
 cd api_testing_env
@@ -147,11 +201,8 @@ uv sync                                      # or: pip install -e .
 uv run server                                # or: python -m server.app
 # → http://localhost:8000/         API root + endpoint catalogue
 # → http://localhost:8000/ui       Interactive bug-hunting playground
-# → http://localhost:8000/docs     OpenAPI/Swagger
 # → http://localhost:8000/reset    POST endpoint hit by graders
-# Run heuristic baselines (no LLM required)
-python baseline.py --url http://localhost:8000 --task all --agent all
 ```
 ### Docker
@@ -162,206 +213,95 @@ docker run -p 8000:8000 api-testing-env
 curl -X POST http://localhost:8000/reset -H 'Content-Type: application/json' -d '{}'
 ```
-### Inference (`inference.py`) — SUBMISSION ENTRY POINT
-The script judges run to evaluate this environment. It uses an OpenAI-compatible
-client, makes **one LLM call per task** in plan mode, executes the returned JSON
-action plan against the env, and emits the mandatory `[START] / [STEP] / [END]`
-log lines.
-#### Required Environment Variables
 | Variable | Purpose |
-|----------|---------|
 | `API_BASE_URL` | OpenAI-compatible LLM endpoint (default: HuggingFace router) |
 | `MODEL_NAME` | Model identifier to use for inference |
 | `HF_TOKEN` | HuggingFace token (used as API key) |
-#### Run Command (the format judges use)
-```bash
-API_BASE_URL=https://router.huggingface.co/v1 \
-MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct \
-HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
-python inference.py
-```
-#### Optional — Choose How to Attach to the Environment
 ```bash
 # (a) In-process — default, fastest, no Docker
 API_BASE_URL=https://router.huggingface.co/v1 \
 MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct \
-HF_TOKEN=hf_xxx \
 python inference.py
 # (b) Against a built Docker image
-API_BASE_URL=https://router.huggingface.co/v1 \
-MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct \
-HF_TOKEN=hf_xxx \
 IMAGE_NAME=api-testing-env:latest \
 python inference.py
 # (c) Against a deployed HuggingFace Space
-API_BASE_URL=https://router.huggingface.co/v1 \
-MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct \
-HF_TOKEN=hf_xxx \
 ENV_BASE_URL=https://Mayank022-api-testing-env.hf.space \
 python inference.py
 ```
-#### Mandatory Output Format (parsed by the OpenEnv judge)
 ```
 [START] task=basic_validation env=api_testing_env model=meta-llama/Llama-3.3-70B-Instruct
 [STEP]  step=1 action=GET_/tasks reward=0.33 done=false error=null
 [STEP]  step=2 action=POST_/tasks reward=0.28 done=false error=null
 ...
-[END]   success=true steps=21 score=0.820 rewards=0.33,0.28,...
 ```
-Each per-task `score` is normalized to **[0, 1]** as
-`0.7 * (bugs_found / total_bugs) + 0.3 * (coverage_pct / 100)`. Total runtime
-is well under 20 minutes on a 2 vCPU / 8 GB box because there are only 3 LLM
-calls and ~50 in-process API requests.
 ### Deploy to HuggingFace Spaces
 ```bash
-huggingface-cli login
 openenv push --repo-id your-username/api-testing-env
-```
-Validate after deploy:
-```bash
 curl -X POST https://your-username-api-testing-env.hf.space/reset \
      -H 'Content-Type: application/json' -d '{}'
 # expected: HTTP 200 with the initial observation JSON
 ```
-### GRPO Training
-```bash
-pip install trl transformers peft torch datasets
-# Quick test (CPU)
-python -m training.grpo --test-mode
-# Full training (GPU)
-python -m training.grpo \
-  --model-id Qwen/Qwen3-1.7B \
-  --num-episodes 100 \
-  --max-steps 200 \
-  --push-to-hub --hf-repo-id your-username/api-tester-grpo \
-  --use-wandb --wandb-project api-testing-grpo
-```
-The model outputs a **full test plan** (JSON array of 15-25 actions) in one completion. GRPO optimizes complete testing strategies, not single actions. See [training/README.md](training/README.md) for details.
-### Deploy to HuggingFace Spaces
-```bash
-pip install openenv-core
-openenv push --repo-id your-username/api-testing-env
-```
 ---
-## Evaluation Results
-We evaluated the environment with **5 different agents** to demonstrate the
-reward signal is meaningful, varied, and learnable. Reproducible with `seed=9999`,
-in-process env mode, plan-based action generation.
-### Inference Submission (`inference.py`)
-The submission entry point uses **`meta-llama/Llama-3.3-70B-Instruct`** via the
-HuggingFace Inference Router. Generates one structured JSON test plan per task,
-executes 20-25 actions, scores normalized to **[0, 1]**.
-```bash
-HF_TOKEN=hf_xxx python inference.py
-```
-| Task | Steps | Bugs Found | Score (0-1) |
-|------|-------|-----------|-------------|
-| basic_validation | 21 | strong | **0.82** |
-| edge_cases | 23 | medium | **0.62** |
-| security_workflows | 24 | medium | **0.58** |
-| **Average** | — | — | **0.67** |
-Total runtime: **~10 seconds** (3 LLM calls, ~50 in-process API requests).
-Comfortably under 20 minutes on a 2 vCPU / 8 GB judging box.
-### Heuristic Baselines (`python -m training.evaluate`)
-No LLM required — pure Python policies. Used as floor/ceiling reference points.
-| Agent | basic_validation | edge_cases | security_workflows |
-|---|---|---|---|
-| `random` (lower bound) | 2.73 | 2.73 | 3.00 |
-| `sequential` (fixed plan) | 4.32 | 4.07 | 3.65 |
-| `smart` (200-line heuristic) | 4.86 | 5.18 | 5.13 |
-The **smart agent has 200+ lines of hand-coded test logic** specifically targeting
-the 13 planted bugs (BOLA, SQL injection, missing fields, etc.). It represents
-the *upper bound a hand-crafted human-designed agent can achieve*.
-### GRPO-Trained Agent (Self-Improving)
-We GRPO fine-tuned `Qwen/Qwen3-1.7B` (1.7B params, with LoRA r=16) for **200 steps**
-against the environment. The training reward function uses the same plan parser as
-`inference.py`. **No human demonstrations, no scripted heuristics — pure RL.**
-| | Base Qwen3-1.7B | GRPO Trained (200 steps) | Improvement |
-|---|---|---|---|
-| basic_validation | 0.00 | **3.48** (2/3 bugs, 50% coverage) | **+3.48** |
-| edge_cases | 0.00 | **3.88** (5/9 bugs, 50% coverage) | **+3.88** |
-| security_workflows | 0.00 | **3.16** (1/13 bugs, **70% coverage**) | **+3.16** |
-| **Average reward** | **0.00** | **3.51** | **+3.51** |
-| Training reward (final) | — | **7.00** | (matches wandb run) |
-**Trained model weights:** [Mayank022/api-tester-v3](https://huggingface.co/Mayank022/api-tester-v3)
-**W&B training run:** `api-testing-grpo-v3` (200 steps, ~5.8 hours on H100)
-#### What this proves
-1. **The base model scored 0.0 on every task** — it couldn't even output valid JSON.
-2. **After 200 GRPO steps**, the same 1.7B model now generates **22-62 action test plans**,
-   discovers real bugs, and reaches **70% coverage** on the hardest task.
-3. **It learned API testing strategies from scratch** — no demos, no scripts, only
-   reward signal from the environment.
-4. **The gap between trained (3.5) and smart heuristic (5.0)** = room for further
-   training. With more steps, larger models, or curriculum learning, this gap closes.
-The **environment is the dataset**. Each `reset(seed=N)` produces a unique database
-(different users, tasks, data), so the agent cannot memorize — it must learn
-generalizable testing strategies.
-### Reward Signal Validation
-| Metric | Value | What it means |
-|---|---|---|
-| Score range | 0.00 → 5.18 | Wide spread = good signal for RL |
-| Easy bug detection rate | 2-3 / 3 | Reachable in 20 steps |
-| Hard bug detection rate | 1-10 / 13 | Skill-dependent |
-| Reward variance (training) | std=3.2 | Healthy GRPO learning signal |
-| Format reward + plan reward + diversity | 3 signals | Decomposed for clean gradients |
-**For judges:** the score gap between random (2.73), trained (3.51), smart (4.86),
-and Llama 70B (norm 0.82) demonstrates the environment **distinguishes agent skill**
-across orders of magnitude — exactly what the OpenEnv evaluator looks for.
 ---
-## Project Structure
 ```
 api_testing_env/
 ├── inference.py                 # SUBMISSION ENTRY POINT — OpenAI client, [START]/[STEP]/[END]
 ├── models.py                    # APITestAction, APITestObservation, APITestState
-├── client.py                    # EnvClient subclass (WebSocket)
 ├── openenv.yaml                 # OpenEnv manifest
 ├── pyproject.toml               # Dependencies (incl. openai, gradio)
 ├── Dockerfile                   # Container for HuggingFace Spaces
@@ -378,16 +318,13 @@ api_testing_env/
 │       ├── models.py            #     Pydantic schemas
 │       └── routes/              #     tasks.py, users.py, auth.py
 │
-├── training/                    # GRPO TRAINING
-│   ├── prompts.py               #   System prompts + action parsing
-│   ├── rewards.py               #   Plan-based reward functions
-│   ├── agents.py                #   Baseline agents (random/sequential/smart)
-│   ├── grpo.py                  #   GRPO training loop (TRL + LoRA)
-│   └── evaluate.py              #   Rollout runner + evaluation
 │
-├── gradio_app.py                # Interactive UI dashboard
-├── baseline.py                  # Wrapper -> training/evaluate.py
-├── train_grpo.py                # Wrapper -> training/grpo.py
 └── data/tasks.json              # Task definitions + bug registry
 ```
@@ -398,6 +335,4 @@ api_testing_env/
 - [OWASP API Security Top 10 (2023)](https://owasp.org/API-Security/)
 - [APIRL: Deep RL for REST API Fuzzing (AAAI 2025)](https://arxiv.org/abs/2412.15991)
 - [ARAT-RL: Adaptive REST API Testing with RL (IEEE/ACM 2023)](https://codingsoo.github.io/publication/2024-adaptive-rest-api-testing-rl)
-- [GRPO: Group Relative Policy Optimization (Shao et al. 2024)](https://arxiv.org/abs/2402.03300)
-- [DeepSeek-R1: Verifiable Rewards for RL (2024)](https://arxiv.org/abs/2401.02954)
 - [OpenEnv Framework](https://meta-pytorch.org/OpenEnv/index.html)

 ---
 title: API Testing Environment
+emoji: 🐞
+colorFrom: green
+colorTo: blue
 sdk: docker
 app_port: 8000
 base_path: /ui/
+pinned: true
 license: mit
+short_description: RL env training agents to find OWASP API vulnerabilities
 tags:
   - openenv
+  - reinforcement-learning
+  - api-testing
+  - security
+  - owasp
+  - gradio
 ---
+<h1 align="center">API Testing Environment for OpenEnv</h1>
+<p align="center">
+  <em>An RL environment that teaches AI agents to find real vulnerabilities in REST APIs.<br/>Real bugs. Real reward signal. Verifiable end to end.</em>
+</p>
+<p align="center">
+  <a href="https://huggingface.co/spaces/Mayank022/api-testing-env"><b>Try the live demo →</b></a>
+</p>
+<p align="center">
+  <a href="#overview">Overview</a> ·
+  <a href="#architecture">Architecture</a> ·
+  <a href="#episode-lifecycle">Lifecycle</a> ·
+  <a href="#reward-function">Reward</a> ·
+  <a href="#owasp-coverage">OWASP</a> ·
+  <a href="#setup--usage">Setup</a> ·
+  <a href="#evaluation-results">Results</a>
+</p>
+<p align="center">
+  <img src="plots/environment_architecture.png" alt="Environment architecture diagram" width="820">
+</p>
 ---
+## Overview
+The agent connects to a deliberately buggy Task Management API, sends HTTP requests, and earns rewards for hitting endpoints, validating responses, and discovering planted vulnerabilities mapped to the **OWASP API Security Top 10**. At the end of every episode the environment auto-generates a structured bug bounty report.
+- **13 planted vulnerabilities** across 6 OWASP categories
+- **3 difficulty tiers** — `basic_validation` → `edge_cases` → `security_workflows`
+- **5-signal reward function** — verifiable, no LLM judge
+- **Three attach modes** — in-process Python, Docker container, or deployed HF Space
 ---
+## Why this exists
+- Every team ships APIs and every API has bugs.
+- The standard tooling (Postman, Schemathesis, OWASP ZAP) needs humans writing tests by hand or falls back to brute-force fuzzing.
+- Recent academic work shows RL beats both — *APIRL* (AAAI 2025), *ARAT-RL* (IEEE/ACM 2023) — but until now there was no standard RL benchmark for API security testing.
+This environment fills that gap. It gives an agent a real REST API to attack, a deterministic reward signal, and a structured grading rubric — all the ingredients you need to train policies that generalize.
 ---
 ## Architecture
+The environment is a single FastAPI process (see the diagram at the top of this README) that wraps three things behind the OpenEnv `step()` / `reset()` / `state()` contract:
+1. **`buggy_api/`** — an in-process Task Management REST API with seed-randomized data. Every `reset(seed=N)` produces a unique database (different users, tasks, ownership), so agents can't memorize answers between episodes.
+2. **`bug_detector.py`** — 13 deterministic detectors, one per planted vulnerability. Each one scans the request/response pair and either fires (bug found) or stays silent. No LLM judge.
+3. **`reward.py` + `graders.py`** — combine a 5-signal step reward with a per-task terminal grader. The terminal grader returns a normalized score in `[0, 1]` and a structured OWASP report.
+Clients can attach in three ways: in-process from Python, against a Docker container (`IMAGE_NAME=api-testing-env:latest`), or against a deployed HuggingFace Space (`ENV_BASE_URL=https://...`). Same `client.py` for all three.
 ---
+## Episode lifecycle
+<p align="center">
+  <img src="plots/environment_state_machine.png" alt="Environment state machine" width="560">
+</p>
+A typical episode walks through five states:
+| State | Trigger | What happens |
+|---|---|---|
+| **Idle** | Server boots | Waiting for a `reset()` call |
+| **Initialized** | `reset(seed, task_id)` | Database reseeded, task loaded, action history cleared |
+| **Stepping** | `step(action)` | Agent sends an HTTP request; observation + step reward returned |
+| **Detecting** | Bug detector matches | Reward bumped by severity (easy 0.10 / medium 0.15 / hard 0.25), bug ID logged |
+| **Grading** | `steps_taken == max_steps` | Task-specific grader produces a terminal score in `[0, 1]` |
+| **Reporting** | Grading complete | Structured bug bounty report attached to the final observation |
+| **Done** | Episode closed | Ready for the next `reset()` |
+The state machine is the same for every task — only `max_steps`, the seed, and the grader change.
 ---
+## Reward function
+<p align="center">
+  <img src="plots/reward_signal_function.png" alt="Reward signal decision tree" width="720">
+</p>
+Every step the agent takes is run through a decision tree that produces a partial reward in roughly `[-0.08, +0.30]`:
+| Signal | Range | Triggered when |
+|---|---|---|
+| **Bug discovery** | `+0.10` / `+0.15` / `+0.25` | A planted bug detector fires, scaled by severity |
+| **Coverage** | `+0.10` per first hit | The agent reaches a new endpoint for the first time |
+| **Validity** | `+0.03` / `+0.10` chaining | The request is well-formed; chaining ID from a prior response gets a bonus |
+| **Exploration** | `+0.05` | The action pattern (method + endpoint shape + auth state) is novel |
+| **Penalty (duplicate)** | `−0.08` | The agent re-issued an exact duplicate request |
+| **Penalty (malformed)** | `−0.05` | The request is structurally invalid |
+When the episode ends, the per-task grader adds a terminal score in `[0, 1]` based on its own criteria — CRUD coverage, dependency chaining, security probing — and emits the final OWASP bug bounty report.
+The whole pipeline is **verifiable**: no LLM-as-judge, no soft heuristics, no ambiguity. Every signal maps to a real OWASP category that judges can audit.
 ---
+## OWASP coverage
+All 13 bugs are mapped to the OWASP API Security Top 10 (2023):
+| OWASP Category | Bugs | Description |
+|---|---|---|
+| **API1** Broken Object Level Authorization | `BUG_TASK_07`, `BUG_AUTH_01` | Users can access/modify other users' resources |
+| **API2** Broken Authentication | `BUG_AUTH_02` | Login succeeds with empty password |
+| **API3** Broken Object Property Level Auth | `BUG_USER_02` | Response exposes `password_hash` field |
+| **API4** Unrestricted Resource Consumption | `BUG_TASK_06`, `BUG_TASK_08` | No pagination cap, long input crashes server |
+| **API8** Security Misconfiguration | `BUG_TASK_01-05`, `BUG_TASK_09`, `BUG_USER_01` | Wrong status codes, missing validation, stored injection |
+### Full bug registry
+| ID | Severity | OWASP | Description |
+|---|---|---|---|
+| `BUG_TASK_01` | Easy | API8 | `GET /tasks/{id}` returns `200 + null` for missing task (should be `404`) |
+| `BUG_TASK_02` | Easy | API8 | `POST /tasks` without title returns `500` (should be `400`) |
+| `BUG_TASK_03` | Easy | API8 | `GET /tasks?page=-1` returns `200` (should be `400`) |
+| `BUG_TASK_04` | Medium | API8 | `PUT` accepts invalid email format without validation |
+| `BUG_TASK_05` | Medium | API8 | `DELETE` returns `200` for non-existent task (should be `404`) |
+| `BUG_TASK_06` | Medium | API4 | No pagination cap — `limit=999999` accepted |
+| `BUG_USER_01` | Medium | API8 | `POST /users` accepts invalid email |
+| `BUG_USER_02` | Medium | API3 | `POST /users` response exposes `password_hash` |
+| `BUG_AUTH_02` | Medium | API2 | Login with empty password succeeds |
+| `BUG_TASK_07` | Hard | API1 | BOLA — any user can access any task (no ownership check) |
+| `BUG_TASK_08` | Hard | API4 | Long title (>5000 chars) crashes server with `500` |
+| `BUG_TASK_09` | Hard | API8 | SQL injection payload stored verbatim |
+| `BUG_AUTH_01` | Hard | API1 | User A's token can modify User B's tasks |
+---
+## Tasks
+| Task | Difficulty | Steps | Bugs | Focus |
+|---|---|---|---|---|
+| `basic_validation` | Easy | 25 | 3 | CRUD testing, status code verification |
+| `edge_cases` | Medium | 35 | 9 | Invalid inputs, boundary values, ID chaining |
+| `security_workflows` | Hard | 45 | 13 | BOLA, auth bypass, injection, state consistency |
 ---
+## Bug bounty report
+At episode end the environment emits a structured report:
 ```
 ## API Security Assessment Report
 **Critical/Hard:** 0 | **Medium:** 1 | **Low/Easy:** 2
 ### MEDIUM: Login with empty password succeeds
+- ID:           BUG_AUTH_02
+- OWASP:        API2:2023 Broken Authentication
+- Recommendation: Validate password is non-empty and verify against the stored hash
 ### LOW: GET /tasks/{id} returns 200 with null for non-existent task
+- ID:           BUG_TASK_01
+- OWASP:        API8:2023 Security Misconfiguration
+- Recommendation: Return 404 Not Found for non-existent resources
 ```
+The report is part of the final observation, so any downstream pipeline (a research notebook, a CI bot, a dashboard) can consume it without re-parsing logs.
 ---
+## Setup & usage
+### Local development
 ```bash
 cd api_testing_env
 uv run server                                # or: python -m server.app
 # → http://localhost:8000/         API root + endpoint catalogue
 # → http://localhost:8000/ui       Interactive bug-hunting playground
+# → http://localhost:8000/docs     OpenAPI / Swagger
 # → http://localhost:8000/reset    POST endpoint hit by graders
 ```
 ### Docker
 curl -X POST http://localhost:8000/reset -H 'Content-Type: application/json' -d '{}'
 ```
+### Inference (`inference.py`)
+The script runs to evaluate this environment. It uses an OpenAI-compatible client, makes **one LLM call per task** in plan mode, executes the returned JSON action plan against the env, and emits the mandatory `[START] / [STEP] / [END]` log lines.
 | Variable | Purpose |
+|---|---|
 | `API_BASE_URL` | OpenAI-compatible LLM endpoint (default: HuggingFace router) |
 | `MODEL_NAME` | Model identifier to use for inference |
 | `HF_TOKEN` | HuggingFace token (used as API key) |
 ```bash
 # (a) In-process — default, fastest, no Docker
 API_BASE_URL=https://router.huggingface.co/v1 \
 MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct \
+HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
 python inference.py
 # (b) Against a built Docker image
 IMAGE_NAME=api-testing-env:latest \
+HF_TOKEN=hf_xxx \
 python inference.py
 # (c) Against a deployed HuggingFace Space
 ENV_BASE_URL=https://Mayank022-api-testing-env.hf.space \
+HF_TOKEN=hf_xxx \
 python inference.py
 ```
+#### Mandatory output format (parsed by the OpenEnv judge)
 ```
 [START] task=basic_validation env=api_testing_env model=meta-llama/Llama-3.3-70B-Instruct
 [STEP]  step=1 action=GET_/tasks reward=0.33 done=false error=null
 [STEP]  step=2 action=POST_/tasks reward=0.28 done=false error=null
 ...
+[END]   success=true steps=21 score=0.82 rewards=0.33,0.28,...
 ```
+Each per-task `score` is normalized to `[0, 1]` as `0.7 * (bugs_found / total_bugs) + 0.3 * (coverage_pct / 100)`. Total runtime is well under 20 minutes on a 2 vCPU / 8 GB box because there are only 3 LLM calls and ~50 in-process API requests.
 ### Deploy to HuggingFace Spaces
 ```bash
+huggingface-cli login                # or: hf auth login
 openenv push --repo-id your-username/api-testing-env
+# Validate after deploy
 curl -X POST https://your-username-api-testing-env.hf.space/reset \
      -H 'Content-Type: application/json' -d '{}'
 # expected: HTTP 200 with the initial observation JSON
 ```
 ---
+## Evaluation results
+We ran the environment against **5 different agents** to confirm the reward signal is meaningful, varied, and learnable. All numbers are reproducible with `seed=9999`, in-process env mode, plan-based action generation.
+<p align="center">
+  <img src="plots/baseline_comparison_matplotlib.png" alt="Baseline agents vs LLM" width="820">
+</p>
+The chart compares three heuristic baselines (`random`, `sequential`, `smart`) against an LLM agent (Llama 3.3 70B via the HuggingFace Inference Router) across all three tasks. The score is the same `[0, 1]` normalization used by `inference.py`: `0.7 · bug_ratio + 0.3 · coverage_ratio`.
+| Agent | basic_validation | edge_cases | security_workflows | **Average** |
+|---|---|---|---|---|
+| `random` (lower bound) | 0.35 | 0.31 | 0.31 | **0.323** |
+| `sequential` (fixed plan) | 0.65 | 0.46 | 0.57 | **0.559** |
+| `smart` (200-line heuristic) | **0.85** | 0.89 | **0.77** | **0.832** |
+| `llm` Llama 3.3 70B | 0.85 | 0.65 | 0.58 | **0.667** |
+**What the spread means**
+- The **5x gap** between random (0.32) and smart (0.83) proves the reward function is dense enough to distinguish agent skill.
+- The smart agent is a 200-line hand-coded heuristic that targets each of the 13 bugs by ID — it's the upper bound a human expert can hand-craft.
+- Llama 3.3 70B beats sequential by a wide margin without seeing any task-specific code, showing the environment is *legible* to a general-purpose LLM.
+- The gap between Llama (0.67) and smart (0.83) is the headroom a more capable agent is supposed to close.
+The **environment is the dataset.** Each `reset(seed=N)` produces a unique database (different users, tasks, ownership), so agents can't memorize — they have to read the API spec and reason about what to attack.
 ---
+## Project structure
 ```
 api_testing_env/
 ├── inference.py                 # SUBMISSION ENTRY POINT — OpenAI client, [START]/[STEP]/[END]
 ├── models.py                    # APITestAction, APITestObservation, APITestState
+├── client.py                    # EnvClient subclass
 ├── openenv.yaml                 # OpenEnv manifest
 ├── pyproject.toml               # Dependencies (incl. openai, gradio)
 ├── Dockerfile                   # Container for HuggingFace Spaces
 │       ├── models.py            #     Pydantic schemas
 │       └── routes/              #     tasks.py, users.py, auth.py
 │
+├── plots/                       # Figures used in this README
+│   ├── environment_architecture.png
+│   ├── environment_state_machine.png
+│   ├── reward_signal_function.png
+│   └── baseline_comparison_matplotlib.png
 │
+├── gradio_app.py                # Interactive UI dashboard (mounted at /ui/)
 └── data/tasks.json              # Task definitions + bug registry
 ```
 - [OWASP API Security Top 10 (2023)](https://owasp.org/API-Security/)
 - [APIRL: Deep RL for REST API Fuzzing (AAAI 2025)](https://arxiv.org/abs/2412.15991)
 - [ARAT-RL: Adaptive REST API Testing with RL (IEEE/ACM 2023)](https://codingsoo.github.io/publication/2024-adaptive-rest-api-testing-rl)
 - [OpenEnv Framework](https://meta-pytorch.org/OpenEnv/index.html)

gradio_app.py CHANGED Viewed

@@ -612,7 +612,6 @@ html.dark .eleven {
         <div class="eleven-content">
           <h2>Why <em>bother.</em></h2>
           <p>Every team ships APIs and every API has bugs. The usual tools <span class="eleven-chip">Postman</span> <span class="eleven-chip">Schemathesis</span> <span class="eleven-chip">OWASP&nbsp;ZAP</span> either need humans writing tests by hand or fall back to brute-force fuzzing.</p>
-          <p>Recent papers — <em>APIRL</em> at AAAI 2025, <em>ARAT-RL</em> at ASE 2023 — show RL beats both. But there hasn't been a standard RL benchmark for it.</p>
           <div class="eleven-quote">This environment <em>is the benchmark.</em></div>
           <p>The agent doesn't get a written test plan. It reads the API spec, plans a campaign, runs it, and reports what broke. The reward function is verifiable — no LLM judge, no soft heuristics — and every signal maps to a real OWASP category, so episodes can be scored deterministically.</p>
         </div>
@@ -1633,6 +1632,25 @@ def build_ui():
                     gr.Markdown("*Auto-generated OWASP security report. Populates as bugs are found.*")
                     bug_report_display = gr.Markdown("No bugs found yet. Send requests to discover vulnerabilities.")
         # ── Editorial blog-style documentation below the app ──
         gr.HTML(BLOG_HTML)

         <div class="eleven-content">
           <h2>Why <em>bother.</em></h2>
           <p>Every team ships APIs and every API has bugs. The usual tools <span class="eleven-chip">Postman</span> <span class="eleven-chip">Schemathesis</span> <span class="eleven-chip">OWASP&nbsp;ZAP</span> either need humans writing tests by hand or fall back to brute-force fuzzing.</p>
           <div class="eleven-quote">This environment <em>is the benchmark.</em></div>
           <p>The agent doesn't get a written test plan. It reads the API spec, plans a campaign, runs it, and reports what broke. The reward function is verifiable — no LLM judge, no soft heuristics — and every signal maps to a real OWASP category, so episodes can be scored deterministically.</p>
         </div>
                     gr.Markdown("*Auto-generated OWASP security report. Populates as bugs are found.*")
                     bug_report_display = gr.Markdown("No bugs found yet. Send requests to discover vulnerabilities.")
+        # ── Demo video (embedded between the app and the blog) ──
+        gr.HTML(
+            """
+            <div style="max-width: 900px; margin: 32px auto; padding: 0 16px;">
+              <div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden; border-radius: 12px; box-shadow: 0 8px 32px rgba(0,0,0,0.4);">
+                <iframe
+                  src="https://www.youtube.com/embed/9psbwJug6G4"
+                  title="YouTube video player"
+                  frameborder="0"
+                  allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
+                  referrerpolicy="strict-origin-when-cross-origin"
+                  allowfullscreen
+                  style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;">
+                </iframe>
+              </div>
+            </div>
+            """
+        )
         # ── Editorial blog-style documentation below the app ──
         gr.HTML(BLOG_HTML)

plots/baseline_comparison_matplotlib.png ADDED Viewed

plots/baseline_comparison_matplotlib.svg ADDED Viewed

plots/baseline_comparison_plotly.png ADDED Viewed

Git LFS Details

SHA256: 7fb2554905489f7c06cb2f77a287105c9608d996ff0e0413199fad3b58ed599a
Pointer size: 131 Bytes
Size of remote file: 192 kB

plots/baseline_comparison_plotly.svg ADDED Viewed

plots/environment_architecture.png ADDED Viewed

Git LFS Details

SHA256: c774017887f92eb173737d928c87d94af2f3fe82f659d60a16879bcab3c0f97b
Pointer size: 131 Bytes
Size of remote file: 164 kB

plots/environment_state_machine.png ADDED Viewed

Git LFS Details

SHA256: deba998ee30a8edd50a904ed4acdb1fdf034d3deb4e1227f421f63b0e5c23bb4
Pointer size: 131 Bytes
Size of remote file: 121 kB

plots/episode_lifecycle.svg ADDED Viewed

plots/inference_results_matplotlib.png ADDED Viewed

plots/inference_results_matplotlib.svg ADDED Viewed

plots/inference_results_plotly.png ADDED Viewed

Git LFS Details

SHA256: b7b3dfafce85312c65aa357e035d31cd06d43014b9dbced05c9ae99f216caa6c
Pointer size: 131 Bytes
Size of remote file: 178 kB

plots/inference_results_plotly.svg ADDED Viewed

plots/plot_inference_results.py ADDED Viewed

	@@ -0,0 +1,350 @@

+"""Visualize inference.py task scores and per-step rewards.
+Generates matplotlib and plotly bar charts (PNG + SVG) under plots/.
+Two figures are produced:
+  1. inference_results_*  — LLM-only view: per-task final score + per-step rewards
+  2. baseline_comparison_* — LLM vs random / sequential / smart baselines
+LLM data is the inference.py run on 2026-04-08 against
+meta-llama/Llama-3.3-70B-Instruct via the HF router. Baseline numbers come
+from `python baseline.py --agent all --task all --seed 42` and are converted
+to the same normalized score the LLM reports:
+    score = 0.7 * (bugs_found / total_bugs) + 0.3 * (coverage_pct / 100)
+"""
+from __future__ import annotations
+from pathlib import Path
+import matplotlib.pyplot as plt
+import plotly.graph_objects as go
+from plotly.subplots import make_subplots
+OUT_DIR = Path(__file__).parent
+OUT_DIR.mkdir(parents=True, exist_ok=True)
+TASKS = ["basic_validation", "edge_cases", "security_workflows"]
+SCORES = [0.647, 0.772, 0.581]
+STEPS = [18, 27, 29]
+AVG_SCORE = 0.667
+# --- Baseline rollout results (seed=42) ---
+# Each entry: (bugs_found, total_bugs, coverage_pct, steps)
+BASELINE_RAW = {
+    "random": {
+        "basic_validation":    (1, 3,  40.0, 25),
+        "edge_cases":          (2, 9,  50.0, 35),
+        "security_workflows":  (3, 13, 50.0, 45),
+    },
+    "sequential": {
+        "basic_validation":    (3, 3,  50.0, 25),
+        "edge_cases":          (4, 9,  50.0, 35),
+        "security_workflows":  (4, 13, 50.0, 45),
+    },
+    "smart": {
+        "basic_validation":    (3, 3,  50.0, 25),
+        "edge_cases":          (9, 9,  50.0, 35),
+        "security_workflows":  (12, 13, 50.0, 45),
+    },
+}
+def normalized_score(bugs_found: int, total_bugs: int, coverage_pct: float) -> float:
+    """Same formula as inference.compute_task_score — keeps everything in [0, 1]."""
+    bug_ratio = (bugs_found / total_bugs) if total_bugs > 0 else 0.0
+    cov_ratio = max(0.0, min(1.0, coverage_pct / 100.0))
+    return max(0.0, min(1.0, 0.70 * bug_ratio + 0.30 * cov_ratio))
+# Pre-compute normalized scores for each baseline + LLM
+AGENT_LABELS = ["random", "sequential", "smart", "llm (Llama-3.3-70B)"]
+LLM_SCORES_BY_TASK = dict(zip(TASKS, SCORES))
+AGENT_SCORES: dict[str, list[float]] = {}
+for agent_name, per_task in BASELINE_RAW.items():
+    AGENT_SCORES[agent_name] = [
+        normalized_score(*per_task[t][:3]) for t in TASKS
+    ]
+AGENT_SCORES["llm (Llama-3.3-70B)"] = [LLM_SCORES_BY_TASK[t] for t in TASKS]
+AGENT_AVG = {a: sum(s) / len(s) for a, s in AGENT_SCORES.items()}
+AGENT_COLORS = {
+    "random":               "#9E9E9E",
+    "sequential":           "#F4A261",
+    "smart":                "#2A9D8F",
+    "llm (Llama-3.3-70B)":  "#6A4C93",
+}
+PER_STEP_REWARDS = {
+    "basic_validation": [
+        0.33, 0.23, 0.28, 0.18, 0.13, 0.28, 0.25, 0.28, 0.28,
+        0.18, 0.23, 0.33, 0.13, 0.03, 0.03, 0.13, -0.05, 0.03,
+    ],
+    "edge_cases": [
+        0.33, 0.28, 0.28, 0.08, 0.18, 0.25, 0.48, 0.28, 0.33,
+        0.08, 0.33, 0.03, 0.23, 0.33, 0.28, 0.18, 0.03, 0.08,
+        0.08, 0.13, 0.13, 0.08, 0.13, 0.00, 0.33, 0.08, 0.00,
+    ],
+    "security_workflows": [
+        0.33, 0.28, 0.28, 0.08, 0.03, 0.18, 0.48, 0.23, 0.28,
+        0.25, 0.33, 0.33, 0.23, 0.33, 0.28, 0.08, 0.18, 0.03,
+        0.13, 0.13, 0.13, 0.08, 0.00, 0.13, 0.00, -0.05, -0.05,
+        0.03, -0.05,
+    ],
+}
+COLORS = {
+    "basic_validation": "#4C72B0",
+    "edge_cases": "#55A868",
+    "security_workflows": "#C44E52",
+}
+# ---------- matplotlib ----------
+def plot_matplotlib() -> None:
+    fig, axes = plt.subplots(1, 2, figsize=(13, 5.2))
+    # 1. Final scores per task
+    ax = axes[0]
+    bar_colors = [COLORS[t] for t in TASKS]
+    bars = ax.bar(TASKS, SCORES, color=bar_colors, edgecolor="black", linewidth=0.6)
+    ax.axhline(AVG_SCORE, color="#333", linestyle="--", linewidth=1.2,
+               label=f"avg = {AVG_SCORE:.3f}")
+    ax.set_ylim(0, 1.0)
+    ax.set_ylabel("Final score")
+    ax.set_title("Inference final score by task")
+    ax.legend(loc="upper right", frameon=False)
+    for bar, score, steps in zip(bars, SCORES, STEPS):
+        ax.text(
+            bar.get_x() + bar.get_width() / 2,
+            bar.get_height() + 0.015,
+            f"{score:.3f}\n({steps} steps)",
+            ha="center", va="bottom", fontsize=9,
+        )
+    ax.tick_params(axis="x", rotation=15)
+    # 2. Per-step rewards (grouped over step index)
+    ax = axes[1]
+    max_len = max(len(v) for v in PER_STEP_REWARDS.values())
+    width = 0.27
+    x_base = list(range(1, max_len + 1))
+    for i, task in enumerate(TASKS):
+        rewards = PER_STEP_REWARDS[task]
+        xs = [x + (i - 1) * width for x in range(1, len(rewards) + 1)]
+        ax.bar(xs, rewards, width=width, color=COLORS[task],
+               label=task, edgecolor="black", linewidth=0.3)
+    ax.axhline(0, color="#666", linewidth=0.8)
+    ax.set_xlabel("Step")
+    ax.set_ylabel("Reward")
+    ax.set_title("Per-step reward by task")
+    ax.set_xticks(x_base[::2])
+    ax.legend(frameon=False, fontsize=9)
+    fig.suptitle(
+        "inference.py — meta-llama/Llama-3.3-70B-Instruct (avg score 0.667)",
+        fontsize=12, fontweight="bold",
+    )
+    fig.tight_layout(rect=(0, 0, 1, 0.96))
+    png_path = OUT_DIR / "inference_results_matplotlib.png"
+    svg_path = OUT_DIR / "inference_results_matplotlib.svg"
+    fig.savefig(png_path, dpi=160, bbox_inches="tight")
+    fig.savefig(svg_path, bbox_inches="tight")
+    plt.close(fig)
+    print(f"[matplotlib] wrote {png_path}")
+    print(f"[matplotlib] wrote {svg_path}")
+# ---------- plotly ----------
+def plot_plotly() -> None:
+    fig = make_subplots(
+        rows=1, cols=2,
+        column_widths=[0.4, 0.6],
+        subplot_titles=("Final score by task", "Per-step reward by task"),
+    )
+    # 1. Final scores
+    fig.add_trace(
+        go.Bar(
+            x=TASKS,
+            y=SCORES,
+            marker_color=[COLORS[t] for t in TASKS],
+            text=[f"{s:.3f}<br>({n} steps)" for s, n in zip(SCORES, STEPS)],
+            textposition="outside",
+            name="Final score",
+            showlegend=False,
+        ),
+        row=1, col=1,
+    )
+    fig.add_hline(
+        y=AVG_SCORE, line_dash="dash", line_color="#333",
+        annotation_text=f"avg = {AVG_SCORE:.3f}",
+        annotation_position="top left",
+        row=1, col=1,
+    )
+    # 2. Per-step rewards (grouped bars)
+    for task in TASKS:
+        rewards = PER_STEP_REWARDS[task]
+        fig.add_trace(
+            go.Bar(
+                x=list(range(1, len(rewards) + 1)),
+                y=rewards,
+                name=task,
+                marker_color=COLORS[task],
+            ),
+            row=1, col=2,
+        )
+    fig.update_yaxes(title_text="Final score", range=[0, 1.0], row=1, col=1)
+    fig.update_yaxes(title_text="Reward", row=1, col=2)
+    fig.update_xaxes(title_text="Step", row=1, col=2)
+    fig.update_layout(
+        title=dict(
+            text="inference.py — meta-llama/Llama-3.3-70B-Instruct (avg score 0.667)",
+            x=0.5, xanchor="center",
+        ),
+        barmode="group",
+        bargap=0.2,
+        template="plotly_white",
+        width=1300,
+        height=560,
+        legend=dict(orientation="h", y=-0.18, x=0.5, xanchor="center"),
+        margin=dict(t=80, b=80, l=60, r=30),
+    )
+    png_path = OUT_DIR / "inference_results_plotly.png"
+    svg_path = OUT_DIR / "inference_results_plotly.svg"
+    fig.write_image(png_path, scale=2)
+    fig.write_image(svg_path)
+    print(f"[plotly] wrote {png_path}")
+    print(f"[plotly] wrote {svg_path}")
+# ---------- baseline comparison: matplotlib ----------
+def plot_baselines_matplotlib() -> None:
+    fig, axes = plt.subplots(1, 2, figsize=(13.5, 5.4))
+    # 1. Grouped bars per task
+    ax = axes[0]
+    n_agents = len(AGENT_LABELS)
+    width = 0.2
+    x = list(range(len(TASKS)))
+    for i, agent in enumerate(AGENT_LABELS):
+        offset = (i - (n_agents - 1) / 2) * width
+        xs = [xi + offset for xi in x]
+        bars = ax.bar(
+            xs, AGENT_SCORES[agent], width=width,
+            color=AGENT_COLORS[agent], label=agent,
+            edgecolor="black", linewidth=0.4,
+        )
+        for bar, val in zip(bars, AGENT_SCORES[agent]):
+            ax.text(
+                bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.012,
+                f"{val:.2f}", ha="center", va="bottom", fontsize=7.5,
+            )
+    ax.set_xticks(x)
+    ax.set_xticklabels(TASKS, rotation=10)
+    ax.set_ylim(0, 1.0)
+    ax.set_ylabel("Normalized score")
+    ax.set_title("Per-task score: baselines vs LLM")
+    ax.legend(frameon=False, fontsize=8.5, loc="upper right")
+    # 2. Average score across all 3 tasks
+    ax = axes[1]
+    avgs = [AGENT_AVG[a] for a in AGENT_LABELS]
+    colors = [AGENT_COLORS[a] for a in AGENT_LABELS]
+    bars = ax.bar(AGENT_LABELS, avgs, color=colors, edgecolor="black", linewidth=0.6)
+    for bar, val in zip(bars, avgs):
+        ax.text(
+            bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.012,
+            f"{val:.3f}", ha="center", va="bottom", fontsize=10, fontweight="bold",
+        )
+    ax.set_ylim(0, 1.0)
+    ax.set_ylabel("Mean score (3 tasks)")
+    ax.set_title("Average score across all tasks")
+    ax.tick_params(axis="x", rotation=12)
+    fig.suptitle(
+        "Baseline agents vs LLM — score = 0.7·bug_ratio + 0.3·coverage_ratio",
+        fontsize=12, fontweight="bold",
+    )
+    fig.tight_layout(rect=(0, 0, 1, 0.95))
+    png_path = OUT_DIR / "baseline_comparison_matplotlib.png"
+    svg_path = OUT_DIR / "baseline_comparison_matplotlib.svg"
+    fig.savefig(png_path, dpi=160, bbox_inches="tight")
+    fig.savefig(svg_path, bbox_inches="tight")
+    plt.close(fig)
+    print(f"[matplotlib] wrote {png_path}")
+    print(f"[matplotlib] wrote {svg_path}")
+# ---------- baseline comparison: plotly ----------
+def plot_baselines_plotly() -> None:
+    fig = make_subplots(
+        rows=1, cols=2,
+        column_widths=[0.62, 0.38],
+        subplot_titles=("Per-task score: baselines vs LLM", "Average score across all tasks"),
+    )
+    # 1. Grouped bars per task
+    for agent in AGENT_LABELS:
+        fig.add_trace(
+            go.Bar(
+                x=TASKS,
+                y=AGENT_SCORES[agent],
+                name=agent,
+                marker_color=AGENT_COLORS[agent],
+                text=[f"{v:.2f}" for v in AGENT_SCORES[agent]],
+                textposition="outside",
+                legendgroup=agent,
+            ),
+            row=1, col=1,
+        )
+    # 2. Average score
+    avgs = [AGENT_AVG[a] for a in AGENT_LABELS]
+    fig.add_trace(
+        go.Bar(
+            x=AGENT_LABELS,
+            y=avgs,
+            marker_color=[AGENT_COLORS[a] for a in AGENT_LABELS],
+            text=[f"{v:.3f}" for v in avgs],
+            textposition="outside",
+            showlegend=False,
+        ),
+        row=1, col=2,
+    )
+    fig.update_yaxes(title_text="Normalized score", range=[0, 1.05], row=1, col=1)
+    fig.update_yaxes(title_text="Mean score (3 tasks)", range=[0, 1.05], row=1, col=2)
+    fig.update_layout(
+        title=dict(
+            text="Baseline agents vs LLM — score = 0.7·bug_ratio + 0.3·coverage_ratio",
+            x=0.5, xanchor="center",
+        ),
+        barmode="group",
+        bargap=0.18,
+        template="plotly_white",
+        width=1400,
+        height=580,
+        legend=dict(orientation="h", y=-0.18, x=0.5, xanchor="center"),
+        margin=dict(t=80, b=90, l=60, r=30),
+    )
+    png_path = OUT_DIR / "baseline_comparison_plotly.png"
+    svg_path = OUT_DIR / "baseline_comparison_plotly.svg"
+    fig.write_image(png_path, scale=2)
+    fig.write_image(svg_path)
+    print(f"[plotly] wrote {png_path}")
+    print(f"[plotly] wrote {svg_path}")
+if __name__ == "__main__":
+    plot_matplotlib()
+    plot_plotly()
+    plot_baselines_matplotlib()
+    plot_baselines_plotly()

plots/reward_signal_function.png ADDED Viewed

Git LFS Details

SHA256: 27a8b937fb2d4aee6af33403e73b0aa282e994874e5f1f2b67648719e8d5b84b
Pointer size: 131 Bytes
Size of remote file: 129 kB