Spaces:
Running
Running
Upload folder using huggingface_hub
Browse files- .gitattributes +5 -0
- README.md +172 -237
- gradio_app.py +19 -1
- plots/baseline_comparison_matplotlib.png +0 -0
- plots/baseline_comparison_matplotlib.svg +2540 -0
- plots/baseline_comparison_plotly.png +3 -0
- plots/baseline_comparison_plotly.svg +1 -0
- plots/environment_architecture.png +3 -0
- plots/environment_state_machine.png +3 -0
- plots/episode_lifecycle.svg +1 -0
- plots/inference_results_matplotlib.png +0 -0
- plots/inference_results_matplotlib.svg +2993 -0
- plots/inference_results_plotly.png +3 -0
- plots/inference_results_plotly.svg +1 -0
- plots/plot_inference_results.py +350 -0
- plots/reward_signal_function.png +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
plots/baseline_comparison_plotly.png filter=lfs diff=lfs merge=lfs -text
|
| 37 |
+
plots/environment_architecture.png filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
plots/environment_state_machine.png filter=lfs diff=lfs merge=lfs -text
|
| 39 |
+
plots/inference_results_plotly.png filter=lfs diff=lfs merge=lfs -text
|
| 40 |
+
plots/reward_signal_function.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -1,120 +1,172 @@
|
|
| 1 |
---
|
| 2 |
title: API Testing Environment
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: docker
|
| 7 |
app_port: 8000
|
| 8 |
base_path: /ui/
|
| 9 |
-
pinned:
|
| 10 |
license: mit
|
|
|
|
| 11 |
tags:
|
| 12 |
- openenv
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
---
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
-
##
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
-
|
| 26 |
-
-
|
| 27 |
-
-
|
| 28 |
-
-
|
| 29 |
|
| 30 |
---
|
| 31 |
|
| 32 |
-
##
|
| 33 |
|
| 34 |
-
|
|
|
|
|
|
|
| 35 |
|
| 36 |
-
|
| 37 |
-
|---------------|------|-------------|
|
| 38 |
-
| **API1** Broken Object Level Authorization | BUG_TASK_07, BUG_AUTH_01 | Users can access/modify other users' resources |
|
| 39 |
-
| **API2** Broken Authentication | BUG_AUTH_02 | Login succeeds with empty password |
|
| 40 |
-
| **API3** Broken Object Property Level Auth | BUG_USER_02 | Response exposes password_hash field |
|
| 41 |
-
| **API4** Unrestricted Resource Consumption | BUG_TASK_06, BUG_TASK_08 | No pagination cap, long input crashes server |
|
| 42 |
-
| **API8** Security Misconfiguration | BUG_TASK_01-05, BUG_TASK_09, BUG_USER_01 | Wrong status codes, missing validation, stored injection |
|
| 43 |
|
| 44 |
---
|
| 45 |
|
| 46 |
## Architecture
|
| 47 |
|
| 48 |
-
```
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
│ <──obs──── │ │
|
| 54 |
-
│ ├──> buggy_api/ (in-process FastAPI) │
|
| 55 |
-
│ │ └── routes/ (tasks, users, auth) │
|
| 56 |
-
│ │ └── database.py (SQLite, reset │
|
| 57 |
-
│ │ with seed for randomization) │
|
| 58 |
-
│ │ │
|
| 59 |
-
│ ├──> bug_detector.py (13 detectors) │
|
| 60 |
-
│ ├──> reward.py (5-signal rewards) │
|
| 61 |
-
│ └──> graders.py (scoring + bug report)│
|
| 62 |
-
└──────────────────────────────────────────────────────────┘
|
| 63 |
-
```
|
| 64 |
|
| 65 |
-
|
| 66 |
|
| 67 |
---
|
| 68 |
|
| 69 |
-
##
|
| 70 |
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
|
| 78 |
-
|
|
| 79 |
-
|
|
| 80 |
-
|
|
| 81 |
-
|
|
| 82 |
-
|
|
| 83 |
-
|
|
| 84 |
-
|
|
| 85 |
-
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
---
|
| 88 |
|
| 89 |
-
##
|
| 90 |
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
---
|
| 98 |
|
| 99 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
-
|
| 102 |
|
| 103 |
-
|
|
| 104 |
-
|------
|
| 105 |
-
|
|
| 106 |
-
|
|
| 107 |
-
|
|
| 108 |
-
|
|
| 109 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
---
|
| 114 |
|
| 115 |
-
## Bug
|
| 116 |
|
| 117 |
-
At episode end
|
| 118 |
|
| 119 |
```
|
| 120 |
## API Security Assessment Report
|
|
@@ -123,21 +175,23 @@ At episode end, the environment auto-generates a structured security assessment
|
|
| 123 |
**Critical/Hard:** 0 | **Medium:** 1 | **Low/Easy:** 2
|
| 124 |
|
| 125 |
### MEDIUM: Login with empty password succeeds
|
| 126 |
-
-
|
| 127 |
-
-
|
| 128 |
-
-
|
| 129 |
|
| 130 |
### LOW: GET /tasks/{id} returns 200 with null for non-existent task
|
| 131 |
-
-
|
| 132 |
-
-
|
| 133 |
-
-
|
| 134 |
```
|
| 135 |
|
|
|
|
|
|
|
| 136 |
---
|
| 137 |
|
| 138 |
-
## Setup &
|
| 139 |
|
| 140 |
-
### Local
|
| 141 |
|
| 142 |
```bash
|
| 143 |
cd api_testing_env
|
|
@@ -147,11 +201,8 @@ uv sync # or: pip install -e .
|
|
| 147 |
uv run server # or: python -m server.app
|
| 148 |
# → http://localhost:8000/ API root + endpoint catalogue
|
| 149 |
# → http://localhost:8000/ui Interactive bug-hunting playground
|
| 150 |
-
# → http://localhost:8000/docs OpenAPI/Swagger
|
| 151 |
# → http://localhost:8000/reset POST endpoint hit by graders
|
| 152 |
-
|
| 153 |
-
# Run heuristic baselines (no LLM required)
|
| 154 |
-
python baseline.py --url http://localhost:8000 --task all --agent all
|
| 155 |
```
|
| 156 |
|
| 157 |
### Docker
|
|
@@ -162,206 +213,95 @@ docker run -p 8000:8000 api-testing-env
|
|
| 162 |
curl -X POST http://localhost:8000/reset -H 'Content-Type: application/json' -d '{}'
|
| 163 |
```
|
| 164 |
|
| 165 |
-
### Inference (`inference.py`)
|
| 166 |
|
| 167 |
-
The script
|
| 168 |
-
client, makes **one LLM call per task** in plan mode, executes the returned JSON
|
| 169 |
-
action plan against the env, and emits the mandatory `[START] / [STEP] / [END]`
|
| 170 |
-
log lines.
|
| 171 |
-
|
| 172 |
-
#### Required Environment Variables
|
| 173 |
|
| 174 |
| Variable | Purpose |
|
| 175 |
-
|---
|
| 176 |
| `API_BASE_URL` | OpenAI-compatible LLM endpoint (default: HuggingFace router) |
|
| 177 |
| `MODEL_NAME` | Model identifier to use for inference |
|
| 178 |
| `HF_TOKEN` | HuggingFace token (used as API key) |
|
| 179 |
|
| 180 |
-
#### Run Command (the format judges use)
|
| 181 |
-
|
| 182 |
-
```bash
|
| 183 |
-
API_BASE_URL=https://router.huggingface.co/v1 \
|
| 184 |
-
MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct \
|
| 185 |
-
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
|
| 186 |
-
python inference.py
|
| 187 |
-
```
|
| 188 |
-
|
| 189 |
-
#### Optional — Choose How to Attach to the Environment
|
| 190 |
-
|
| 191 |
```bash
|
| 192 |
# (a) In-process — default, fastest, no Docker
|
| 193 |
API_BASE_URL=https://router.huggingface.co/v1 \
|
| 194 |
MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct \
|
| 195 |
-
HF_TOKEN=
|
| 196 |
python inference.py
|
| 197 |
|
| 198 |
# (b) Against a built Docker image
|
| 199 |
-
API_BASE_URL=https://router.huggingface.co/v1 \
|
| 200 |
-
MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct \
|
| 201 |
-
HF_TOKEN=hf_xxx \
|
| 202 |
IMAGE_NAME=api-testing-env:latest \
|
|
|
|
| 203 |
python inference.py
|
| 204 |
|
| 205 |
# (c) Against a deployed HuggingFace Space
|
| 206 |
-
API_BASE_URL=https://router.huggingface.co/v1 \
|
| 207 |
-
MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct \
|
| 208 |
-
HF_TOKEN=hf_xxx \
|
| 209 |
ENV_BASE_URL=https://Mayank022-api-testing-env.hf.space \
|
|
|
|
| 210 |
python inference.py
|
| 211 |
```
|
| 212 |
|
| 213 |
-
#### Mandatory
|
| 214 |
|
| 215 |
```
|
| 216 |
[START] task=basic_validation env=api_testing_env model=meta-llama/Llama-3.3-70B-Instruct
|
| 217 |
[STEP] step=1 action=GET_/tasks reward=0.33 done=false error=null
|
| 218 |
[STEP] step=2 action=POST_/tasks reward=0.28 done=false error=null
|
| 219 |
...
|
| 220 |
-
[END] success=true steps=21 score=0.
|
| 221 |
```
|
| 222 |
|
| 223 |
-
Each per-task `score` is normalized to
|
| 224 |
-
`0.7 * (bugs_found / total_bugs) + 0.3 * (coverage_pct / 100)`. Total runtime
|
| 225 |
-
is well under 20 minutes on a 2 vCPU / 8 GB box because there are only 3 LLM
|
| 226 |
-
calls and ~50 in-process API requests.
|
| 227 |
|
| 228 |
### Deploy to HuggingFace Spaces
|
| 229 |
|
| 230 |
```bash
|
| 231 |
-
huggingface-cli login
|
| 232 |
openenv push --repo-id your-username/api-testing-env
|
| 233 |
-
```
|
| 234 |
|
| 235 |
-
Validate after deploy
|
| 236 |
-
|
| 237 |
-
```bash
|
| 238 |
curl -X POST https://your-username-api-testing-env.hf.space/reset \
|
| 239 |
-H 'Content-Type: application/json' -d '{}'
|
| 240 |
# expected: HTTP 200 with the initial observation JSON
|
| 241 |
```
|
| 242 |
|
| 243 |
-
### GRPO Training
|
| 244 |
-
|
| 245 |
-
```bash
|
| 246 |
-
pip install trl transformers peft torch datasets
|
| 247 |
-
|
| 248 |
-
# Quick test (CPU)
|
| 249 |
-
python -m training.grpo --test-mode
|
| 250 |
-
|
| 251 |
-
# Full training (GPU)
|
| 252 |
-
python -m training.grpo \
|
| 253 |
-
--model-id Qwen/Qwen3-1.7B \
|
| 254 |
-
--num-episodes 100 \
|
| 255 |
-
--max-steps 200 \
|
| 256 |
-
--push-to-hub --hf-repo-id your-username/api-tester-grpo \
|
| 257 |
-
--use-wandb --wandb-project api-testing-grpo
|
| 258 |
-
```
|
| 259 |
-
|
| 260 |
-
The model outputs a **full test plan** (JSON array of 15-25 actions) in one completion. GRPO optimizes complete testing strategies, not single actions. See [training/README.md](training/README.md) for details.
|
| 261 |
-
|
| 262 |
-
### Deploy to HuggingFace Spaces
|
| 263 |
-
|
| 264 |
-
```bash
|
| 265 |
-
pip install openenv-core
|
| 266 |
-
openenv push --repo-id your-username/api-testing-env
|
| 267 |
-
```
|
| 268 |
-
|
| 269 |
---
|
| 270 |
|
| 271 |
-
## Evaluation
|
| 272 |
|
| 273 |
-
We
|
| 274 |
-
reward signal is meaningful, varied, and learnable. Reproducible with `seed=9999`,
|
| 275 |
-
in-process env mode, plan-based action generation.
|
| 276 |
|
| 277 |
-
|
|
|
|
|
|
|
| 278 |
|
| 279 |
-
The
|
| 280 |
-
HuggingFace Inference Router. Generates one structured JSON test plan per task,
|
| 281 |
-
executes 20-25 actions, scores normalized to **[0, 1]**.
|
| 282 |
-
|
| 283 |
-
```bash
|
| 284 |
-
HF_TOKEN=hf_xxx python inference.py
|
| 285 |
-
```
|
| 286 |
|
| 287 |
-
|
|
| 288 |
-
|------|---
|
| 289 |
-
|
|
| 290 |
-
|
|
| 291 |
-
|
|
| 292 |
-
|
|
| 293 |
-
|
| 294 |
-
Total runtime: **~10 seconds** (3 LLM calls, ~50 in-process API requests).
|
| 295 |
-
Comfortably under 20 minutes on a 2 vCPU / 8 GB judging box.
|
| 296 |
-
|
| 297 |
-
### Heuristic Baselines (`python -m training.evaluate`)
|
| 298 |
-
|
| 299 |
-
No LLM required — pure Python policies. Used as floor/ceiling reference points.
|
| 300 |
-
|
| 301 |
-
| Agent | basic_validation | edge_cases | security_workflows |
|
| 302 |
-
|---|---|---|---|
|
| 303 |
-
| `random` (lower bound) | 2.73 | 2.73 | 3.00 |
|
| 304 |
-
| `sequential` (fixed plan) | 4.32 | 4.07 | 3.65 |
|
| 305 |
-
| `smart` (200-line heuristic) | 4.86 | 5.18 | 5.13 |
|
| 306 |
|
| 307 |
-
|
| 308 |
-
the 13 planted bugs (BOLA, SQL injection, missing fields, etc.). It represents
|
| 309 |
-
the *upper bound a hand-crafted human-designed agent can achieve*.
|
| 310 |
|
| 311 |
-
|
| 312 |
-
|
| 313 |
-
|
| 314 |
-
|
| 315 |
-
`inference.py`. **No human demonstrations, no scripted heuristics — pure RL.**
|
| 316 |
-
|
| 317 |
-
| | Base Qwen3-1.7B | GRPO Trained (200 steps) | Improvement |
|
| 318 |
-
|---|---|---|---|
|
| 319 |
-
| basic_validation | 0.00 | **3.48** (2/3 bugs, 50% coverage) | **+3.48** |
|
| 320 |
-
| edge_cases | 0.00 | **3.88** (5/9 bugs, 50% coverage) | **+3.88** |
|
| 321 |
-
| security_workflows | 0.00 | **3.16** (1/13 bugs, **70% coverage**) | **+3.16** |
|
| 322 |
-
| **Average reward** | **0.00** | **3.51** | **+3.51** |
|
| 323 |
-
| Training reward (final) | — | **7.00** | (matches wandb run) |
|
| 324 |
-
|
| 325 |
-
**Trained model weights:** [Mayank022/api-tester-v3](https://huggingface.co/Mayank022/api-tester-v3)
|
| 326 |
-
**W&B training run:** `api-testing-grpo-v3` (200 steps, ~5.8 hours on H100)
|
| 327 |
-
|
| 328 |
-
#### What this proves
|
| 329 |
-
|
| 330 |
-
1. **The base model scored 0.0 on every task** — it couldn't even output valid JSON.
|
| 331 |
-
2. **After 200 GRPO steps**, the same 1.7B model now generates **22-62 action test plans**,
|
| 332 |
-
discovers real bugs, and reaches **70% coverage** on the hardest task.
|
| 333 |
-
3. **It learned API testing strategies from scratch** — no demos, no scripts, only
|
| 334 |
-
reward signal from the environment.
|
| 335 |
-
4. **The gap between trained (3.5) and smart heuristic (5.0)** = room for further
|
| 336 |
-
training. With more steps, larger models, or curriculum learning, this gap closes.
|
| 337 |
-
|
| 338 |
-
The **environment is the dataset**. Each `reset(seed=N)` produces a unique database
|
| 339 |
-
(different users, tasks, data), so the agent cannot memorize — it must learn
|
| 340 |
-
generalizable testing strategies.
|
| 341 |
-
|
| 342 |
-
### Reward Signal Validation
|
| 343 |
-
|
| 344 |
-
| Metric | Value | What it means |
|
| 345 |
-
|---|---|---|
|
| 346 |
-
| Score range | 0.00 → 5.18 | Wide spread = good signal for RL |
|
| 347 |
-
| Easy bug detection rate | 2-3 / 3 | Reachable in 20 steps |
|
| 348 |
-
| Hard bug detection rate | 1-10 / 13 | Skill-dependent |
|
| 349 |
-
| Reward variance (training) | std=3.2 | Healthy GRPO learning signal |
|
| 350 |
-
| Format reward + plan reward + diversity | 3 signals | Decomposed for clean gradients |
|
| 351 |
|
| 352 |
-
**
|
| 353 |
-
and Llama 70B (norm 0.82) demonstrates the environment **distinguishes agent skill**
|
| 354 |
-
across orders of magnitude — exactly what the OpenEnv evaluator looks for.
|
| 355 |
|
| 356 |
---
|
| 357 |
|
| 358 |
-
## Project
|
| 359 |
|
| 360 |
```
|
| 361 |
api_testing_env/
|
| 362 |
├── inference.py # SUBMISSION ENTRY POINT — OpenAI client, [START]/[STEP]/[END]
|
| 363 |
├── models.py # APITestAction, APITestObservation, APITestState
|
| 364 |
-
├── client.py # EnvClient subclass
|
| 365 |
├── openenv.yaml # OpenEnv manifest
|
| 366 |
├── pyproject.toml # Dependencies (incl. openai, gradio)
|
| 367 |
├── Dockerfile # Container for HuggingFace Spaces
|
|
@@ -378,16 +318,13 @@ api_testing_env/
|
|
| 378 |
│ ├── models.py # Pydantic schemas
|
| 379 |
│ └── routes/ # tasks.py, users.py, auth.py
|
| 380 |
│
|
| 381 |
-
├──
|
| 382 |
-
│ ├──
|
| 383 |
-
│ ├──
|
| 384 |
-
│ ├──
|
| 385 |
-
│
|
| 386 |
-
│ └── evaluate.py # Rollout runner + evaluation
|
| 387 |
│
|
| 388 |
-
├── gradio_app.py # Interactive UI dashboard
|
| 389 |
-
├── baseline.py # Wrapper -> training/evaluate.py
|
| 390 |
-
├── train_grpo.py # Wrapper -> training/grpo.py
|
| 391 |
└── data/tasks.json # Task definitions + bug registry
|
| 392 |
```
|
| 393 |
|
|
@@ -398,6 +335,4 @@ api_testing_env/
|
|
| 398 |
- [OWASP API Security Top 10 (2023)](https://owasp.org/API-Security/)
|
| 399 |
- [APIRL: Deep RL for REST API Fuzzing (AAAI 2025)](https://arxiv.org/abs/2412.15991)
|
| 400 |
- [ARAT-RL: Adaptive REST API Testing with RL (IEEE/ACM 2023)](https://codingsoo.github.io/publication/2024-adaptive-rest-api-testing-rl)
|
| 401 |
-
- [GRPO: Group Relative Policy Optimization (Shao et al. 2024)](https://arxiv.org/abs/2402.03300)
|
| 402 |
-
- [DeepSeek-R1: Verifiable Rewards for RL (2024)](https://arxiv.org/abs/2401.02954)
|
| 403 |
- [OpenEnv Framework](https://meta-pytorch.org/OpenEnv/index.html)
|
|
|
|
| 1 |
---
|
| 2 |
title: API Testing Environment
|
| 3 |
+
emoji: 🐞
|
| 4 |
+
colorFrom: green
|
| 5 |
+
colorTo: blue
|
| 6 |
sdk: docker
|
| 7 |
app_port: 8000
|
| 8 |
base_path: /ui/
|
| 9 |
+
pinned: true
|
| 10 |
license: mit
|
| 11 |
+
short_description: RL env training agents to find OWASP API vulnerabilities
|
| 12 |
tags:
|
| 13 |
- openenv
|
| 14 |
+
- reinforcement-learning
|
| 15 |
+
- api-testing
|
| 16 |
+
- security
|
| 17 |
+
- owasp
|
| 18 |
+
- gradio
|
| 19 |
---
|
| 20 |
|
| 21 |
+
<h1 align="center">API Testing Environment for OpenEnv</h1>
|
| 22 |
|
| 23 |
+
<p align="center">
|
| 24 |
+
<em>An RL environment that teaches AI agents to find real vulnerabilities in REST APIs.<br/>Real bugs. Real reward signal. Verifiable end to end.</em>
|
| 25 |
+
</p>
|
| 26 |
|
| 27 |
+
<p align="center">
|
| 28 |
+
<a href="https://huggingface.co/spaces/Mayank022/api-testing-env"><b>Try the live demo →</b></a>
|
| 29 |
+
</p>
|
| 30 |
+
|
| 31 |
+
<p align="center">
|
| 32 |
+
<a href="#overview">Overview</a> ·
|
| 33 |
+
<a href="#architecture">Architecture</a> ·
|
| 34 |
+
<a href="#episode-lifecycle">Lifecycle</a> ·
|
| 35 |
+
<a href="#reward-function">Reward</a> ·
|
| 36 |
+
<a href="#owasp-coverage">OWASP</a> ·
|
| 37 |
+
<a href="#setup--usage">Setup</a> ·
|
| 38 |
+
<a href="#evaluation-results">Results</a>
|
| 39 |
+
</p>
|
| 40 |
+
|
| 41 |
+
<p align="center">
|
| 42 |
+
<img src="plots/environment_architecture.png" alt="Environment architecture diagram" width="820">
|
| 43 |
+
</p>
|
| 44 |
|
| 45 |
---
|
| 46 |
|
| 47 |
+
## Overview
|
| 48 |
+
|
| 49 |
+
The agent connects to a deliberately buggy Task Management API, sends HTTP requests, and earns rewards for hitting endpoints, validating responses, and discovering planted vulnerabilities mapped to the **OWASP API Security Top 10**. At the end of every episode the environment auto-generates a structured bug bounty report.
|
| 50 |
|
| 51 |
+
- **13 planted vulnerabilities** across 6 OWASP categories
|
| 52 |
+
- **3 difficulty tiers** — `basic_validation` → `edge_cases` → `security_workflows`
|
| 53 |
+
- **5-signal reward function** — verifiable, no LLM judge
|
| 54 |
+
- **Three attach modes** — in-process Python, Docker container, or deployed HF Space
|
| 55 |
|
| 56 |
---
|
| 57 |
|
| 58 |
+
## Why this exists
|
| 59 |
|
| 60 |
+
- Every team ships APIs and every API has bugs.
|
| 61 |
+
- The standard tooling (Postman, Schemathesis, OWASP ZAP) needs humans writing tests by hand or falls back to brute-force fuzzing.
|
| 62 |
+
- Recent academic work shows RL beats both — *APIRL* (AAAI 2025), *ARAT-RL* (IEEE/ACM 2023) — but until now there was no standard RL benchmark for API security testing.
|
| 63 |
|
| 64 |
+
This environment fills that gap. It gives an agent a real REST API to attack, a deterministic reward signal, and a structured grading rubric — all the ingredients you need to train policies that generalize.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
---
|
| 67 |
|
| 68 |
## Architecture
|
| 69 |
|
| 70 |
+
The environment is a single FastAPI process (see the diagram at the top of this README) that wraps three things behind the OpenEnv `step()` / `reset()` / `state()` contract:
|
| 71 |
+
|
| 72 |
+
1. **`buggy_api/`** — an in-process Task Management REST API with seed-randomized data. Every `reset(seed=N)` produces a unique database (different users, tasks, ownership), so agents can't memorize answers between episodes.
|
| 73 |
+
2. **`bug_detector.py`** — 13 deterministic detectors, one per planted vulnerability. Each one scans the request/response pair and either fires (bug found) or stays silent. No LLM judge.
|
| 74 |
+
3. **`reward.py` + `graders.py`** — combine a 5-signal step reward with a per-task terminal grader. The terminal grader returns a normalized score in `[0, 1]` and a structured OWASP report.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
+
Clients can attach in three ways: in-process from Python, against a Docker container (`IMAGE_NAME=api-testing-env:latest`), or against a deployed HuggingFace Space (`ENV_BASE_URL=https://...`). Same `client.py` for all three.
|
| 77 |
|
| 78 |
---
|
| 79 |
|
| 80 |
+
## Episode lifecycle
|
| 81 |
|
| 82 |
+
<p align="center">
|
| 83 |
+
<img src="plots/environment_state_machine.png" alt="Environment state machine" width="560">
|
| 84 |
+
</p>
|
| 85 |
+
|
| 86 |
+
A typical episode walks through five states:
|
| 87 |
+
|
| 88 |
+
| State | Trigger | What happens |
|
| 89 |
+
|---|---|---|
|
| 90 |
+
| **Idle** | Server boots | Waiting for a `reset()` call |
|
| 91 |
+
| **Initialized** | `reset(seed, task_id)` | Database reseeded, task loaded, action history cleared |
|
| 92 |
+
| **Stepping** | `step(action)` | Agent sends an HTTP request; observation + step reward returned |
|
| 93 |
+
| **Detecting** | Bug detector matches | Reward bumped by severity (easy 0.10 / medium 0.15 / hard 0.25), bug ID logged |
|
| 94 |
+
| **Grading** | `steps_taken == max_steps` | Task-specific grader produces a terminal score in `[0, 1]` |
|
| 95 |
+
| **Reporting** | Grading complete | Structured bug bounty report attached to the final observation |
|
| 96 |
+
| **Done** | Episode closed | Ready for the next `reset()` |
|
| 97 |
+
|
| 98 |
+
The state machine is the same for every task — only `max_steps`, the seed, and the grader change.
|
| 99 |
|
| 100 |
---
|
| 101 |
|
| 102 |
+
## Reward function
|
| 103 |
|
| 104 |
+
<p align="center">
|
| 105 |
+
<img src="plots/reward_signal_function.png" alt="Reward signal decision tree" width="720">
|
| 106 |
+
</p>
|
| 107 |
+
|
| 108 |
+
Every step the agent takes is run through a decision tree that produces a partial reward in roughly `[-0.08, +0.30]`:
|
| 109 |
+
|
| 110 |
+
| Signal | Range | Triggered when |
|
| 111 |
+
|---|---|---|
|
| 112 |
+
| **Bug discovery** | `+0.10` / `+0.15` / `+0.25` | A planted bug detector fires, scaled by severity |
|
| 113 |
+
| **Coverage** | `+0.10` per first hit | The agent reaches a new endpoint for the first time |
|
| 114 |
+
| **Validity** | `+0.03` / `+0.10` chaining | The request is well-formed; chaining ID from a prior response gets a bonus |
|
| 115 |
+
| **Exploration** | `+0.05` | The action pattern (method + endpoint shape + auth state) is novel |
|
| 116 |
+
| **Penalty (duplicate)** | `−0.08` | The agent re-issued an exact duplicate request |
|
| 117 |
+
| **Penalty (malformed)** | `−0.05` | The request is structurally invalid |
|
| 118 |
+
|
| 119 |
+
When the episode ends, the per-task grader adds a terminal score in `[0, 1]` based on its own criteria — CRUD coverage, dependency chaining, security probing — and emits the final OWASP bug bounty report.
|
| 120 |
+
|
| 121 |
+
The whole pipeline is **verifiable**: no LLM-as-judge, no soft heuristics, no ambiguity. Every signal maps to a real OWASP category that judges can audit.
|
| 122 |
|
| 123 |
---
|
| 124 |
|
| 125 |
+
## OWASP coverage
|
| 126 |
+
|
| 127 |
+
All 13 bugs are mapped to the OWASP API Security Top 10 (2023):
|
| 128 |
+
|
| 129 |
+
| OWASP Category | Bugs | Description |
|
| 130 |
+
|---|---|---|
|
| 131 |
+
| **API1** Broken Object Level Authorization | `BUG_TASK_07`, `BUG_AUTH_01` | Users can access/modify other users' resources |
|
| 132 |
+
| **API2** Broken Authentication | `BUG_AUTH_02` | Login succeeds with empty password |
|
| 133 |
+
| **API3** Broken Object Property Level Auth | `BUG_USER_02` | Response exposes `password_hash` field |
|
| 134 |
+
| **API4** Unrestricted Resource Consumption | `BUG_TASK_06`, `BUG_TASK_08` | No pagination cap, long input crashes server |
|
| 135 |
+
| **API8** Security Misconfiguration | `BUG_TASK_01-05`, `BUG_TASK_09`, `BUG_USER_01` | Wrong status codes, missing validation, stored injection |
|
| 136 |
|
| 137 |
+
### Full bug registry
|
| 138 |
|
| 139 |
+
| ID | Severity | OWASP | Description |
|
| 140 |
+
|---|---|---|---|
|
| 141 |
+
| `BUG_TASK_01` | Easy | API8 | `GET /tasks/{id}` returns `200 + null` for missing task (should be `404`) |
|
| 142 |
+
| `BUG_TASK_02` | Easy | API8 | `POST /tasks` without title returns `500` (should be `400`) |
|
| 143 |
+
| `BUG_TASK_03` | Easy | API8 | `GET /tasks?page=-1` returns `200` (should be `400`) |
|
| 144 |
+
| `BUG_TASK_04` | Medium | API8 | `PUT` accepts invalid email format without validation |
|
| 145 |
+
| `BUG_TASK_05` | Medium | API8 | `DELETE` returns `200` for non-existent task (should be `404`) |
|
| 146 |
+
| `BUG_TASK_06` | Medium | API4 | No pagination cap — `limit=999999` accepted |
|
| 147 |
+
| `BUG_USER_01` | Medium | API8 | `POST /users` accepts invalid email |
|
| 148 |
+
| `BUG_USER_02` | Medium | API3 | `POST /users` response exposes `password_hash` |
|
| 149 |
+
| `BUG_AUTH_02` | Medium | API2 | Login with empty password succeeds |
|
| 150 |
+
| `BUG_TASK_07` | Hard | API1 | BOLA — any user can access any task (no ownership check) |
|
| 151 |
+
| `BUG_TASK_08` | Hard | API4 | Long title (>5000 chars) crashes server with `500` |
|
| 152 |
+
| `BUG_TASK_09` | Hard | API8 | SQL injection payload stored verbatim |
|
| 153 |
+
| `BUG_AUTH_01` | Hard | API1 | User A's token can modify User B's tasks |
|
| 154 |
|
| 155 |
+
---
|
| 156 |
+
|
| 157 |
+
## Tasks
|
| 158 |
+
|
| 159 |
+
| Task | Difficulty | Steps | Bugs | Focus |
|
| 160 |
+
|---|---|---|---|---|
|
| 161 |
+
| `basic_validation` | Easy | 25 | 3 | CRUD testing, status code verification |
|
| 162 |
+
| `edge_cases` | Medium | 35 | 9 | Invalid inputs, boundary values, ID chaining |
|
| 163 |
+
| `security_workflows` | Hard | 45 | 13 | BOLA, auth bypass, injection, state consistency |
|
| 164 |
|
| 165 |
---
|
| 166 |
|
| 167 |
+
## Bug bounty report
|
| 168 |
|
| 169 |
+
At episode end the environment emits a structured report:
|
| 170 |
|
| 171 |
```
|
| 172 |
## API Security Assessment Report
|
|
|
|
| 175 |
**Critical/Hard:** 0 | **Medium:** 1 | **Low/Easy:** 2
|
| 176 |
|
| 177 |
### MEDIUM: Login with empty password succeeds
|
| 178 |
+
- ID: BUG_AUTH_02
|
| 179 |
+
- OWASP: API2:2023 Broken Authentication
|
| 180 |
+
- Recommendation: Validate password is non-empty and verify against the stored hash
|
| 181 |
|
| 182 |
### LOW: GET /tasks/{id} returns 200 with null for non-existent task
|
| 183 |
+
- ID: BUG_TASK_01
|
| 184 |
+
- OWASP: API8:2023 Security Misconfiguration
|
| 185 |
+
- Recommendation: Return 404 Not Found for non-existent resources
|
| 186 |
```
|
| 187 |
|
| 188 |
+
The report is part of the final observation, so any downstream pipeline (a research notebook, a CI bot, a dashboard) can consume it without re-parsing logs.
|
| 189 |
+
|
| 190 |
---
|
| 191 |
|
| 192 |
+
## Setup & usage
|
| 193 |
|
| 194 |
+
### Local development
|
| 195 |
|
| 196 |
```bash
|
| 197 |
cd api_testing_env
|
|
|
|
| 201 |
uv run server # or: python -m server.app
|
| 202 |
# → http://localhost:8000/ API root + endpoint catalogue
|
| 203 |
# → http://localhost:8000/ui Interactive bug-hunting playground
|
| 204 |
+
# → http://localhost:8000/docs OpenAPI / Swagger
|
| 205 |
# → http://localhost:8000/reset POST endpoint hit by graders
|
|
|
|
|
|
|
|
|
|
| 206 |
```
|
| 207 |
|
| 208 |
### Docker
|
|
|
|
| 213 |
curl -X POST http://localhost:8000/reset -H 'Content-Type: application/json' -d '{}'
|
| 214 |
```
|
| 215 |
|
| 216 |
+
### Inference (`inference.py`)
|
| 217 |
|
| 218 |
+
The script runs to evaluate this environment. It uses an OpenAI-compatible client, makes **one LLM call per task** in plan mode, executes the returned JSON action plan against the env, and emits the mandatory `[START] / [STEP] / [END]` log lines.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 219 |
|
| 220 |
| Variable | Purpose |
|
| 221 |
+
|---|---|
|
| 222 |
| `API_BASE_URL` | OpenAI-compatible LLM endpoint (default: HuggingFace router) |
|
| 223 |
| `MODEL_NAME` | Model identifier to use for inference |
|
| 224 |
| `HF_TOKEN` | HuggingFace token (used as API key) |
|
| 225 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 226 |
```bash
|
| 227 |
# (a) In-process — default, fastest, no Docker
|
| 228 |
API_BASE_URL=https://router.huggingface.co/v1 \
|
| 229 |
MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct \
|
| 230 |
+
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
|
| 231 |
python inference.py
|
| 232 |
|
| 233 |
# (b) Against a built Docker image
|
|
|
|
|
|
|
|
|
|
| 234 |
IMAGE_NAME=api-testing-env:latest \
|
| 235 |
+
HF_TOKEN=hf_xxx \
|
| 236 |
python inference.py
|
| 237 |
|
| 238 |
# (c) Against a deployed HuggingFace Space
|
|
|
|
|
|
|
|
|
|
| 239 |
ENV_BASE_URL=https://Mayank022-api-testing-env.hf.space \
|
| 240 |
+
HF_TOKEN=hf_xxx \
|
| 241 |
python inference.py
|
| 242 |
```
|
| 243 |
|
| 244 |
+
#### Mandatory output format (parsed by the OpenEnv judge)
|
| 245 |
|
| 246 |
```
|
| 247 |
[START] task=basic_validation env=api_testing_env model=meta-llama/Llama-3.3-70B-Instruct
|
| 248 |
[STEP] step=1 action=GET_/tasks reward=0.33 done=false error=null
|
| 249 |
[STEP] step=2 action=POST_/tasks reward=0.28 done=false error=null
|
| 250 |
...
|
| 251 |
+
[END] success=true steps=21 score=0.82 rewards=0.33,0.28,...
|
| 252 |
```
|
| 253 |
|
| 254 |
+
Each per-task `score` is normalized to `[0, 1]` as `0.7 * (bugs_found / total_bugs) + 0.3 * (coverage_pct / 100)`. Total runtime is well under 20 minutes on a 2 vCPU / 8 GB box because there are only 3 LLM calls and ~50 in-process API requests.
|
|
|
|
|
|
|
|
|
|
| 255 |
|
| 256 |
### Deploy to HuggingFace Spaces
|
| 257 |
|
| 258 |
```bash
|
| 259 |
+
huggingface-cli login # or: hf auth login
|
| 260 |
openenv push --repo-id your-username/api-testing-env
|
|
|
|
| 261 |
|
| 262 |
+
# Validate after deploy
|
|
|
|
|
|
|
| 263 |
curl -X POST https://your-username-api-testing-env.hf.space/reset \
|
| 264 |
-H 'Content-Type: application/json' -d '{}'
|
| 265 |
# expected: HTTP 200 with the initial observation JSON
|
| 266 |
```
|
| 267 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
---
|
| 269 |
|
| 270 |
+
## Evaluation results
|
| 271 |
|
| 272 |
+
We ran the environment against **5 different agents** to confirm the reward signal is meaningful, varied, and learnable. All numbers are reproducible with `seed=9999`, in-process env mode, plan-based action generation.
|
|
|
|
|
|
|
| 273 |
|
| 274 |
+
<p align="center">
|
| 275 |
+
<img src="plots/baseline_comparison_matplotlib.png" alt="Baseline agents vs LLM" width="820">
|
| 276 |
+
</p>
|
| 277 |
|
| 278 |
+
The chart compares three heuristic baselines (`random`, `sequential`, `smart`) against an LLM agent (Llama 3.3 70B via the HuggingFace Inference Router) across all three tasks. The score is the same `[0, 1]` normalization used by `inference.py`: `0.7 · bug_ratio + 0.3 · coverage_ratio`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 279 |
|
| 280 |
+
| Agent | basic_validation | edge_cases | security_workflows | **Average** |
|
| 281 |
+
|---|---|---|---|---|
|
| 282 |
+
| `random` (lower bound) | 0.35 | 0.31 | 0.31 | **0.323** |
|
| 283 |
+
| `sequential` (fixed plan) | 0.65 | 0.46 | 0.57 | **0.559** |
|
| 284 |
+
| `smart` (200-line heuristic) | **0.85** | 0.89 | **0.77** | **0.832** |
|
| 285 |
+
| `llm` Llama 3.3 70B | 0.85 | 0.65 | 0.58 | **0.667** |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 286 |
|
| 287 |
+
**What the spread means**
|
|
|
|
|
|
|
| 288 |
|
| 289 |
+
- The **5x gap** between random (0.32) and smart (0.83) proves the reward function is dense enough to distinguish agent skill.
|
| 290 |
+
- The smart agent is a 200-line hand-coded heuristic that targets each of the 13 bugs by ID — it's the upper bound a human expert can hand-craft.
|
| 291 |
+
- Llama 3.3 70B beats sequential by a wide margin without seeing any task-specific code, showing the environment is *legible* to a general-purpose LLM.
|
| 292 |
+
- The gap between Llama (0.67) and smart (0.83) is the headroom a more capable agent is supposed to close.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 293 |
|
| 294 |
+
The **environment is the dataset.** Each `reset(seed=N)` produces a unique database (different users, tasks, ownership), so agents can't memorize — they have to read the API spec and reason about what to attack.
|
|
|
|
|
|
|
| 295 |
|
| 296 |
---
|
| 297 |
|
| 298 |
+
## Project structure
|
| 299 |
|
| 300 |
```
|
| 301 |
api_testing_env/
|
| 302 |
├── inference.py # SUBMISSION ENTRY POINT — OpenAI client, [START]/[STEP]/[END]
|
| 303 |
├── models.py # APITestAction, APITestObservation, APITestState
|
| 304 |
+
├── client.py # EnvClient subclass
|
| 305 |
├── openenv.yaml # OpenEnv manifest
|
| 306 |
├── pyproject.toml # Dependencies (incl. openai, gradio)
|
| 307 |
├── Dockerfile # Container for HuggingFace Spaces
|
|
|
|
| 318 |
│ ├── models.py # Pydantic schemas
|
| 319 |
│ └── routes/ # tasks.py, users.py, auth.py
|
| 320 |
│
|
| 321 |
+
├── plots/ # Figures used in this README
|
| 322 |
+
│ ├── environment_architecture.png
|
| 323 |
+
│ ├── environment_state_machine.png
|
| 324 |
+
│ ├── reward_signal_function.png
|
| 325 |
+
│ └── baseline_comparison_matplotlib.png
|
|
|
|
| 326 |
│
|
| 327 |
+
├── gradio_app.py # Interactive UI dashboard (mounted at /ui/)
|
|
|
|
|
|
|
| 328 |
└── data/tasks.json # Task definitions + bug registry
|
| 329 |
```
|
| 330 |
|
|
|
|
| 335 |
- [OWASP API Security Top 10 (2023)](https://owasp.org/API-Security/)
|
| 336 |
- [APIRL: Deep RL for REST API Fuzzing (AAAI 2025)](https://arxiv.org/abs/2412.15991)
|
| 337 |
- [ARAT-RL: Adaptive REST API Testing with RL (IEEE/ACM 2023)](https://codingsoo.github.io/publication/2024-adaptive-rest-api-testing-rl)
|
|
|
|
|
|
|
| 338 |
- [OpenEnv Framework](https://meta-pytorch.org/OpenEnv/index.html)
|
gradio_app.py
CHANGED
|
@@ -612,7 +612,6 @@ html.dark .eleven {
|
|
| 612 |
<div class="eleven-content">
|
| 613 |
<h2>Why <em>bother.</em></h2>
|
| 614 |
<p>Every team ships APIs and every API has bugs. The usual tools <span class="eleven-chip">Postman</span> <span class="eleven-chip">Schemathesis</span> <span class="eleven-chip">OWASP ZAP</span> either need humans writing tests by hand or fall back to brute-force fuzzing.</p>
|
| 615 |
-
<p>Recent papers — <em>APIRL</em> at AAAI 2025, <em>ARAT-RL</em> at ASE 2023 — show RL beats both. But there hasn't been a standard RL benchmark for it.</p>
|
| 616 |
<div class="eleven-quote">This environment <em>is the benchmark.</em></div>
|
| 617 |
<p>The agent doesn't get a written test plan. It reads the API spec, plans a campaign, runs it, and reports what broke. The reward function is verifiable — no LLM judge, no soft heuristics — and every signal maps to a real OWASP category, so episodes can be scored deterministically.</p>
|
| 618 |
</div>
|
|
@@ -1633,6 +1632,25 @@ def build_ui():
|
|
| 1633 |
gr.Markdown("*Auto-generated OWASP security report. Populates as bugs are found.*")
|
| 1634 |
bug_report_display = gr.Markdown("No bugs found yet. Send requests to discover vulnerabilities.")
|
| 1635 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1636 |
# ── Editorial blog-style documentation below the app ──
|
| 1637 |
gr.HTML(BLOG_HTML)
|
| 1638 |
|
|
|
|
| 612 |
<div class="eleven-content">
|
| 613 |
<h2>Why <em>bother.</em></h2>
|
| 614 |
<p>Every team ships APIs and every API has bugs. The usual tools <span class="eleven-chip">Postman</span> <span class="eleven-chip">Schemathesis</span> <span class="eleven-chip">OWASP ZAP</span> either need humans writing tests by hand or fall back to brute-force fuzzing.</p>
|
|
|
|
| 615 |
<div class="eleven-quote">This environment <em>is the benchmark.</em></div>
|
| 616 |
<p>The agent doesn't get a written test plan. It reads the API spec, plans a campaign, runs it, and reports what broke. The reward function is verifiable — no LLM judge, no soft heuristics — and every signal maps to a real OWASP category, so episodes can be scored deterministically.</p>
|
| 617 |
</div>
|
|
|
|
| 1632 |
gr.Markdown("*Auto-generated OWASP security report. Populates as bugs are found.*")
|
| 1633 |
bug_report_display = gr.Markdown("No bugs found yet. Send requests to discover vulnerabilities.")
|
| 1634 |
|
| 1635 |
+
# ── Demo video (embedded between the app and the blog) ──
|
| 1636 |
+
gr.HTML(
|
| 1637 |
+
"""
|
| 1638 |
+
<div style="max-width: 900px; margin: 32px auto; padding: 0 16px;">
|
| 1639 |
+
<div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden; border-radius: 12px; box-shadow: 0 8px 32px rgba(0,0,0,0.4);">
|
| 1640 |
+
<iframe
|
| 1641 |
+
src="https://www.youtube.com/embed/9psbwJug6G4"
|
| 1642 |
+
title="YouTube video player"
|
| 1643 |
+
frameborder="0"
|
| 1644 |
+
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
|
| 1645 |
+
referrerpolicy="strict-origin-when-cross-origin"
|
| 1646 |
+
allowfullscreen
|
| 1647 |
+
style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;">
|
| 1648 |
+
</iframe>
|
| 1649 |
+
</div>
|
| 1650 |
+
</div>
|
| 1651 |
+
"""
|
| 1652 |
+
)
|
| 1653 |
+
|
| 1654 |
# ── Editorial blog-style documentation below the app ──
|
| 1655 |
gr.HTML(BLOG_HTML)
|
| 1656 |
|
plots/baseline_comparison_matplotlib.png
ADDED
|
plots/baseline_comparison_matplotlib.svg
ADDED
|
|
plots/baseline_comparison_plotly.png
ADDED
|
Git LFS Details
|
plots/baseline_comparison_plotly.svg
ADDED
|
|
plots/environment_architecture.png
ADDED
|
Git LFS Details
|
plots/environment_state_machine.png
ADDED
|
Git LFS Details
|
plots/episode_lifecycle.svg
ADDED
|
|
plots/inference_results_matplotlib.png
ADDED
|
plots/inference_results_matplotlib.svg
ADDED
|
|
plots/inference_results_plotly.png
ADDED
|
Git LFS Details
|
plots/inference_results_plotly.svg
ADDED
|
|
plots/plot_inference_results.py
ADDED
|
@@ -0,0 +1,350 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Visualize inference.py task scores and per-step rewards.
|
| 2 |
+
|
| 3 |
+
Generates matplotlib and plotly bar charts (PNG + SVG) under plots/.
|
| 4 |
+
|
| 5 |
+
Two figures are produced:
|
| 6 |
+
1. inference_results_* — LLM-only view: per-task final score + per-step rewards
|
| 7 |
+
2. baseline_comparison_* — LLM vs random / sequential / smart baselines
|
| 8 |
+
|
| 9 |
+
LLM data is the inference.py run on 2026-04-08 against
|
| 10 |
+
meta-llama/Llama-3.3-70B-Instruct via the HF router. Baseline numbers come
|
| 11 |
+
from `python baseline.py --agent all --task all --seed 42` and are converted
|
| 12 |
+
to the same normalized score the LLM reports:
|
| 13 |
+
score = 0.7 * (bugs_found / total_bugs) + 0.3 * (coverage_pct / 100)
|
| 14 |
+
"""
|
| 15 |
+
|
| 16 |
+
from __future__ import annotations
|
| 17 |
+
|
| 18 |
+
from pathlib import Path
|
| 19 |
+
|
| 20 |
+
import matplotlib.pyplot as plt
|
| 21 |
+
import plotly.graph_objects as go
|
| 22 |
+
from plotly.subplots import make_subplots
|
| 23 |
+
|
| 24 |
+
OUT_DIR = Path(__file__).parent
|
| 25 |
+
OUT_DIR.mkdir(parents=True, exist_ok=True)
|
| 26 |
+
|
| 27 |
+
TASKS = ["basic_validation", "edge_cases", "security_workflows"]
|
| 28 |
+
SCORES = [0.647, 0.772, 0.581]
|
| 29 |
+
STEPS = [18, 27, 29]
|
| 30 |
+
AVG_SCORE = 0.667
|
| 31 |
+
|
| 32 |
+
# --- Baseline rollout results (seed=42) ---
|
| 33 |
+
# Each entry: (bugs_found, total_bugs, coverage_pct, steps)
|
| 34 |
+
BASELINE_RAW = {
|
| 35 |
+
"random": {
|
| 36 |
+
"basic_validation": (1, 3, 40.0, 25),
|
| 37 |
+
"edge_cases": (2, 9, 50.0, 35),
|
| 38 |
+
"security_workflows": (3, 13, 50.0, 45),
|
| 39 |
+
},
|
| 40 |
+
"sequential": {
|
| 41 |
+
"basic_validation": (3, 3, 50.0, 25),
|
| 42 |
+
"edge_cases": (4, 9, 50.0, 35),
|
| 43 |
+
"security_workflows": (4, 13, 50.0, 45),
|
| 44 |
+
},
|
| 45 |
+
"smart": {
|
| 46 |
+
"basic_validation": (3, 3, 50.0, 25),
|
| 47 |
+
"edge_cases": (9, 9, 50.0, 35),
|
| 48 |
+
"security_workflows": (12, 13, 50.0, 45),
|
| 49 |
+
},
|
| 50 |
+
}
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
def normalized_score(bugs_found: int, total_bugs: int, coverage_pct: float) -> float:
|
| 54 |
+
"""Same formula as inference.compute_task_score — keeps everything in [0, 1]."""
|
| 55 |
+
bug_ratio = (bugs_found / total_bugs) if total_bugs > 0 else 0.0
|
| 56 |
+
cov_ratio = max(0.0, min(1.0, coverage_pct / 100.0))
|
| 57 |
+
return max(0.0, min(1.0, 0.70 * bug_ratio + 0.30 * cov_ratio))
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
# Pre-compute normalized scores for each baseline + LLM
|
| 61 |
+
AGENT_LABELS = ["random", "sequential", "smart", "llm (Llama-3.3-70B)"]
|
| 62 |
+
LLM_SCORES_BY_TASK = dict(zip(TASKS, SCORES))
|
| 63 |
+
|
| 64 |
+
AGENT_SCORES: dict[str, list[float]] = {}
|
| 65 |
+
for agent_name, per_task in BASELINE_RAW.items():
|
| 66 |
+
AGENT_SCORES[agent_name] = [
|
| 67 |
+
normalized_score(*per_task[t][:3]) for t in TASKS
|
| 68 |
+
]
|
| 69 |
+
AGENT_SCORES["llm (Llama-3.3-70B)"] = [LLM_SCORES_BY_TASK[t] for t in TASKS]
|
| 70 |
+
|
| 71 |
+
AGENT_AVG = {a: sum(s) / len(s) for a, s in AGENT_SCORES.items()}
|
| 72 |
+
|
| 73 |
+
AGENT_COLORS = {
|
| 74 |
+
"random": "#9E9E9E",
|
| 75 |
+
"sequential": "#F4A261",
|
| 76 |
+
"smart": "#2A9D8F",
|
| 77 |
+
"llm (Llama-3.3-70B)": "#6A4C93",
|
| 78 |
+
}
|
| 79 |
+
|
| 80 |
+
PER_STEP_REWARDS = {
|
| 81 |
+
"basic_validation": [
|
| 82 |
+
0.33, 0.23, 0.28, 0.18, 0.13, 0.28, 0.25, 0.28, 0.28,
|
| 83 |
+
0.18, 0.23, 0.33, 0.13, 0.03, 0.03, 0.13, -0.05, 0.03,
|
| 84 |
+
],
|
| 85 |
+
"edge_cases": [
|
| 86 |
+
0.33, 0.28, 0.28, 0.08, 0.18, 0.25, 0.48, 0.28, 0.33,
|
| 87 |
+
0.08, 0.33, 0.03, 0.23, 0.33, 0.28, 0.18, 0.03, 0.08,
|
| 88 |
+
0.08, 0.13, 0.13, 0.08, 0.13, 0.00, 0.33, 0.08, 0.00,
|
| 89 |
+
],
|
| 90 |
+
"security_workflows": [
|
| 91 |
+
0.33, 0.28, 0.28, 0.08, 0.03, 0.18, 0.48, 0.23, 0.28,
|
| 92 |
+
0.25, 0.33, 0.33, 0.23, 0.33, 0.28, 0.08, 0.18, 0.03,
|
| 93 |
+
0.13, 0.13, 0.13, 0.08, 0.00, 0.13, 0.00, -0.05, -0.05,
|
| 94 |
+
0.03, -0.05,
|
| 95 |
+
],
|
| 96 |
+
}
|
| 97 |
+
|
| 98 |
+
COLORS = {
|
| 99 |
+
"basic_validation": "#4C72B0",
|
| 100 |
+
"edge_cases": "#55A868",
|
| 101 |
+
"security_workflows": "#C44E52",
|
| 102 |
+
}
|
| 103 |
+
|
| 104 |
+
|
| 105 |
+
# ---------- matplotlib ----------
|
| 106 |
+
def plot_matplotlib() -> None:
|
| 107 |
+
fig, axes = plt.subplots(1, 2, figsize=(13, 5.2))
|
| 108 |
+
|
| 109 |
+
# 1. Final scores per task
|
| 110 |
+
ax = axes[0]
|
| 111 |
+
bar_colors = [COLORS[t] for t in TASKS]
|
| 112 |
+
bars = ax.bar(TASKS, SCORES, color=bar_colors, edgecolor="black", linewidth=0.6)
|
| 113 |
+
ax.axhline(AVG_SCORE, color="#333", linestyle="--", linewidth=1.2,
|
| 114 |
+
label=f"avg = {AVG_SCORE:.3f}")
|
| 115 |
+
ax.set_ylim(0, 1.0)
|
| 116 |
+
ax.set_ylabel("Final score")
|
| 117 |
+
ax.set_title("Inference final score by task")
|
| 118 |
+
ax.legend(loc="upper right", frameon=False)
|
| 119 |
+
for bar, score, steps in zip(bars, SCORES, STEPS):
|
| 120 |
+
ax.text(
|
| 121 |
+
bar.get_x() + bar.get_width() / 2,
|
| 122 |
+
bar.get_height() + 0.015,
|
| 123 |
+
f"{score:.3f}\n({steps} steps)",
|
| 124 |
+
ha="center", va="bottom", fontsize=9,
|
| 125 |
+
)
|
| 126 |
+
ax.tick_params(axis="x", rotation=15)
|
| 127 |
+
|
| 128 |
+
# 2. Per-step rewards (grouped over step index)
|
| 129 |
+
ax = axes[1]
|
| 130 |
+
max_len = max(len(v) for v in PER_STEP_REWARDS.values())
|
| 131 |
+
width = 0.27
|
| 132 |
+
x_base = list(range(1, max_len + 1))
|
| 133 |
+
for i, task in enumerate(TASKS):
|
| 134 |
+
rewards = PER_STEP_REWARDS[task]
|
| 135 |
+
xs = [x + (i - 1) * width for x in range(1, len(rewards) + 1)]
|
| 136 |
+
ax.bar(xs, rewards, width=width, color=COLORS[task],
|
| 137 |
+
label=task, edgecolor="black", linewidth=0.3)
|
| 138 |
+
ax.axhline(0, color="#666", linewidth=0.8)
|
| 139 |
+
ax.set_xlabel("Step")
|
| 140 |
+
ax.set_ylabel("Reward")
|
| 141 |
+
ax.set_title("Per-step reward by task")
|
| 142 |
+
ax.set_xticks(x_base[::2])
|
| 143 |
+
ax.legend(frameon=False, fontsize=9)
|
| 144 |
+
|
| 145 |
+
fig.suptitle(
|
| 146 |
+
"inference.py — meta-llama/Llama-3.3-70B-Instruct (avg score 0.667)",
|
| 147 |
+
fontsize=12, fontweight="bold",
|
| 148 |
+
)
|
| 149 |
+
fig.tight_layout(rect=(0, 0, 1, 0.96))
|
| 150 |
+
|
| 151 |
+
png_path = OUT_DIR / "inference_results_matplotlib.png"
|
| 152 |
+
svg_path = OUT_DIR / "inference_results_matplotlib.svg"
|
| 153 |
+
fig.savefig(png_path, dpi=160, bbox_inches="tight")
|
| 154 |
+
fig.savefig(svg_path, bbox_inches="tight")
|
| 155 |
+
plt.close(fig)
|
| 156 |
+
print(f"[matplotlib] wrote {png_path}")
|
| 157 |
+
print(f"[matplotlib] wrote {svg_path}")
|
| 158 |
+
|
| 159 |
+
|
| 160 |
+
# ---------- plotly ----------
|
| 161 |
+
def plot_plotly() -> None:
|
| 162 |
+
fig = make_subplots(
|
| 163 |
+
rows=1, cols=2,
|
| 164 |
+
column_widths=[0.4, 0.6],
|
| 165 |
+
subplot_titles=("Final score by task", "Per-step reward by task"),
|
| 166 |
+
)
|
| 167 |
+
|
| 168 |
+
# 1. Final scores
|
| 169 |
+
fig.add_trace(
|
| 170 |
+
go.Bar(
|
| 171 |
+
x=TASKS,
|
| 172 |
+
y=SCORES,
|
| 173 |
+
marker_color=[COLORS[t] for t in TASKS],
|
| 174 |
+
text=[f"{s:.3f}<br>({n} steps)" for s, n in zip(SCORES, STEPS)],
|
| 175 |
+
textposition="outside",
|
| 176 |
+
name="Final score",
|
| 177 |
+
showlegend=False,
|
| 178 |
+
),
|
| 179 |
+
row=1, col=1,
|
| 180 |
+
)
|
| 181 |
+
fig.add_hline(
|
| 182 |
+
y=AVG_SCORE, line_dash="dash", line_color="#333",
|
| 183 |
+
annotation_text=f"avg = {AVG_SCORE:.3f}",
|
| 184 |
+
annotation_position="top left",
|
| 185 |
+
row=1, col=1,
|
| 186 |
+
)
|
| 187 |
+
|
| 188 |
+
# 2. Per-step rewards (grouped bars)
|
| 189 |
+
for task in TASKS:
|
| 190 |
+
rewards = PER_STEP_REWARDS[task]
|
| 191 |
+
fig.add_trace(
|
| 192 |
+
go.Bar(
|
| 193 |
+
x=list(range(1, len(rewards) + 1)),
|
| 194 |
+
y=rewards,
|
| 195 |
+
name=task,
|
| 196 |
+
marker_color=COLORS[task],
|
| 197 |
+
),
|
| 198 |
+
row=1, col=2,
|
| 199 |
+
)
|
| 200 |
+
|
| 201 |
+
fig.update_yaxes(title_text="Final score", range=[0, 1.0], row=1, col=1)
|
| 202 |
+
fig.update_yaxes(title_text="Reward", row=1, col=2)
|
| 203 |
+
fig.update_xaxes(title_text="Step", row=1, col=2)
|
| 204 |
+
fig.update_layout(
|
| 205 |
+
title=dict(
|
| 206 |
+
text="inference.py — meta-llama/Llama-3.3-70B-Instruct (avg score 0.667)",
|
| 207 |
+
x=0.5, xanchor="center",
|
| 208 |
+
),
|
| 209 |
+
barmode="group",
|
| 210 |
+
bargap=0.2,
|
| 211 |
+
template="plotly_white",
|
| 212 |
+
width=1300,
|
| 213 |
+
height=560,
|
| 214 |
+
legend=dict(orientation="h", y=-0.18, x=0.5, xanchor="center"),
|
| 215 |
+
margin=dict(t=80, b=80, l=60, r=30),
|
| 216 |
+
)
|
| 217 |
+
|
| 218 |
+
png_path = OUT_DIR / "inference_results_plotly.png"
|
| 219 |
+
svg_path = OUT_DIR / "inference_results_plotly.svg"
|
| 220 |
+
fig.write_image(png_path, scale=2)
|
| 221 |
+
fig.write_image(svg_path)
|
| 222 |
+
print(f"[plotly] wrote {png_path}")
|
| 223 |
+
print(f"[plotly] wrote {svg_path}")
|
| 224 |
+
|
| 225 |
+
|
| 226 |
+
# ---------- baseline comparison: matplotlib ----------
|
| 227 |
+
def plot_baselines_matplotlib() -> None:
|
| 228 |
+
fig, axes = plt.subplots(1, 2, figsize=(13.5, 5.4))
|
| 229 |
+
|
| 230 |
+
# 1. Grouped bars per task
|
| 231 |
+
ax = axes[0]
|
| 232 |
+
n_agents = len(AGENT_LABELS)
|
| 233 |
+
width = 0.2
|
| 234 |
+
x = list(range(len(TASKS)))
|
| 235 |
+
for i, agent in enumerate(AGENT_LABELS):
|
| 236 |
+
offset = (i - (n_agents - 1) / 2) * width
|
| 237 |
+
xs = [xi + offset for xi in x]
|
| 238 |
+
bars = ax.bar(
|
| 239 |
+
xs, AGENT_SCORES[agent], width=width,
|
| 240 |
+
color=AGENT_COLORS[agent], label=agent,
|
| 241 |
+
edgecolor="black", linewidth=0.4,
|
| 242 |
+
)
|
| 243 |
+
for bar, val in zip(bars, AGENT_SCORES[agent]):
|
| 244 |
+
ax.text(
|
| 245 |
+
bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.012,
|
| 246 |
+
f"{val:.2f}", ha="center", va="bottom", fontsize=7.5,
|
| 247 |
+
)
|
| 248 |
+
ax.set_xticks(x)
|
| 249 |
+
ax.set_xticklabels(TASKS, rotation=10)
|
| 250 |
+
ax.set_ylim(0, 1.0)
|
| 251 |
+
ax.set_ylabel("Normalized score")
|
| 252 |
+
ax.set_title("Per-task score: baselines vs LLM")
|
| 253 |
+
ax.legend(frameon=False, fontsize=8.5, loc="upper right")
|
| 254 |
+
|
| 255 |
+
# 2. Average score across all 3 tasks
|
| 256 |
+
ax = axes[1]
|
| 257 |
+
avgs = [AGENT_AVG[a] for a in AGENT_LABELS]
|
| 258 |
+
colors = [AGENT_COLORS[a] for a in AGENT_LABELS]
|
| 259 |
+
bars = ax.bar(AGENT_LABELS, avgs, color=colors, edgecolor="black", linewidth=0.6)
|
| 260 |
+
for bar, val in zip(bars, avgs):
|
| 261 |
+
ax.text(
|
| 262 |
+
bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.012,
|
| 263 |
+
f"{val:.3f}", ha="center", va="bottom", fontsize=10, fontweight="bold",
|
| 264 |
+
)
|
| 265 |
+
ax.set_ylim(0, 1.0)
|
| 266 |
+
ax.set_ylabel("Mean score (3 tasks)")
|
| 267 |
+
ax.set_title("Average score across all tasks")
|
| 268 |
+
ax.tick_params(axis="x", rotation=12)
|
| 269 |
+
|
| 270 |
+
fig.suptitle(
|
| 271 |
+
"Baseline agents vs LLM — score = 0.7·bug_ratio + 0.3·coverage_ratio",
|
| 272 |
+
fontsize=12, fontweight="bold",
|
| 273 |
+
)
|
| 274 |
+
fig.tight_layout(rect=(0, 0, 1, 0.95))
|
| 275 |
+
|
| 276 |
+
png_path = OUT_DIR / "baseline_comparison_matplotlib.png"
|
| 277 |
+
svg_path = OUT_DIR / "baseline_comparison_matplotlib.svg"
|
| 278 |
+
fig.savefig(png_path, dpi=160, bbox_inches="tight")
|
| 279 |
+
fig.savefig(svg_path, bbox_inches="tight")
|
| 280 |
+
plt.close(fig)
|
| 281 |
+
print(f"[matplotlib] wrote {png_path}")
|
| 282 |
+
print(f"[matplotlib] wrote {svg_path}")
|
| 283 |
+
|
| 284 |
+
|
| 285 |
+
# ---------- baseline comparison: plotly ----------
|
| 286 |
+
def plot_baselines_plotly() -> None:
|
| 287 |
+
fig = make_subplots(
|
| 288 |
+
rows=1, cols=2,
|
| 289 |
+
column_widths=[0.62, 0.38],
|
| 290 |
+
subplot_titles=("Per-task score: baselines vs LLM", "Average score across all tasks"),
|
| 291 |
+
)
|
| 292 |
+
|
| 293 |
+
# 1. Grouped bars per task
|
| 294 |
+
for agent in AGENT_LABELS:
|
| 295 |
+
fig.add_trace(
|
| 296 |
+
go.Bar(
|
| 297 |
+
x=TASKS,
|
| 298 |
+
y=AGENT_SCORES[agent],
|
| 299 |
+
name=agent,
|
| 300 |
+
marker_color=AGENT_COLORS[agent],
|
| 301 |
+
text=[f"{v:.2f}" for v in AGENT_SCORES[agent]],
|
| 302 |
+
textposition="outside",
|
| 303 |
+
legendgroup=agent,
|
| 304 |
+
),
|
| 305 |
+
row=1, col=1,
|
| 306 |
+
)
|
| 307 |
+
|
| 308 |
+
# 2. Average score
|
| 309 |
+
avgs = [AGENT_AVG[a] for a in AGENT_LABELS]
|
| 310 |
+
fig.add_trace(
|
| 311 |
+
go.Bar(
|
| 312 |
+
x=AGENT_LABELS,
|
| 313 |
+
y=avgs,
|
| 314 |
+
marker_color=[AGENT_COLORS[a] for a in AGENT_LABELS],
|
| 315 |
+
text=[f"{v:.3f}" for v in avgs],
|
| 316 |
+
textposition="outside",
|
| 317 |
+
showlegend=False,
|
| 318 |
+
),
|
| 319 |
+
row=1, col=2,
|
| 320 |
+
)
|
| 321 |
+
|
| 322 |
+
fig.update_yaxes(title_text="Normalized score", range=[0, 1.05], row=1, col=1)
|
| 323 |
+
fig.update_yaxes(title_text="Mean score (3 tasks)", range=[0, 1.05], row=1, col=2)
|
| 324 |
+
fig.update_layout(
|
| 325 |
+
title=dict(
|
| 326 |
+
text="Baseline agents vs LLM — score = 0.7·bug_ratio + 0.3·coverage_ratio",
|
| 327 |
+
x=0.5, xanchor="center",
|
| 328 |
+
),
|
| 329 |
+
barmode="group",
|
| 330 |
+
bargap=0.18,
|
| 331 |
+
template="plotly_white",
|
| 332 |
+
width=1400,
|
| 333 |
+
height=580,
|
| 334 |
+
legend=dict(orientation="h", y=-0.18, x=0.5, xanchor="center"),
|
| 335 |
+
margin=dict(t=80, b=90, l=60, r=30),
|
| 336 |
+
)
|
| 337 |
+
|
| 338 |
+
png_path = OUT_DIR / "baseline_comparison_plotly.png"
|
| 339 |
+
svg_path = OUT_DIR / "baseline_comparison_plotly.svg"
|
| 340 |
+
fig.write_image(png_path, scale=2)
|
| 341 |
+
fig.write_image(svg_path)
|
| 342 |
+
print(f"[plotly] wrote {png_path}")
|
| 343 |
+
print(f"[plotly] wrote {svg_path}")
|
| 344 |
+
|
| 345 |
+
|
| 346 |
+
if __name__ == "__main__":
|
| 347 |
+
plot_matplotlib()
|
| 348 |
+
plot_plotly()
|
| 349 |
+
plot_baselines_matplotlib()
|
| 350 |
+
plot_baselines_plotly()
|
plots/reward_signal_function.png
ADDED
|
Git LFS Details
|