Mayank022 commited on
Commit
bafcc7e
·
verified ·
1 Parent(s): ea4f1cd

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,8 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ plots/baseline_comparison_plotly.png filter=lfs diff=lfs merge=lfs -text
37
+ plots/environment_architecture.png filter=lfs diff=lfs merge=lfs -text
38
+ plots/environment_state_machine.png filter=lfs diff=lfs merge=lfs -text
39
+ plots/inference_results_plotly.png filter=lfs diff=lfs merge=lfs -text
40
+ plots/reward_signal_function.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,120 +1,172 @@
1
  ---
2
  title: API Testing Environment
3
- emoji: 🛡️
4
- colorFrom: indigo
5
- colorTo: purple
6
  sdk: docker
7
  app_port: 8000
8
  base_path: /ui/
9
- pinned: false
10
  license: mit
 
11
  tags:
12
  - openenv
 
 
 
 
 
13
  ---
14
 
15
- # API Testing Environment for OpenEnv
16
 
17
- An RL environment that trains AI agents to become **automated API security testers** — discovering endpoints, crafting requests, finding vulnerabilities mapped to the **OWASP API Security Top 10**, and generating structured bug bounty reports.
 
 
18
 
19
- The agent explores a deliberately buggy Task Management API with 13 planted vulnerabilities across 6 OWASP categories. It earns rewards for coverage, correctness, and bug discovery. At episode end, a security assessment report is auto-generated.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  ---
22
 
23
- ## Why This Matters
 
 
24
 
25
- - Every software team tests APIs manually or with hand-written test suites
26
- - Existing tools (Postman, Schemathesis, OWASP ZAP) require manual test design or brute-force fuzzing
27
- - Academic research shows RL **outperforms traditional tools** in coverage and fault-finding (ARAT-RL, IEEE/ACM 2023; APIRL, AAAI 2025)
28
- - This environment provides a standardized RL training ground with **verifiable rewards** — deterministic bug detection, not LLM judges
29
 
30
  ---
31
 
32
- ## OWASP Coverage
33
 
34
- All 13 bugs are mapped to the OWASP API Security Top 10 (2023):
 
 
35
 
36
- | OWASP Category | Bugs | Description |
37
- |---------------|------|-------------|
38
- | **API1** Broken Object Level Authorization | BUG_TASK_07, BUG_AUTH_01 | Users can access/modify other users' resources |
39
- | **API2** Broken Authentication | BUG_AUTH_02 | Login succeeds with empty password |
40
- | **API3** Broken Object Property Level Auth | BUG_USER_02 | Response exposes password_hash field |
41
- | **API4** Unrestricted Resource Consumption | BUG_TASK_06, BUG_TASK_08 | No pagination cap, long input crashes server |
42
- | **API8** Security Misconfiguration | BUG_TASK_01-05, BUG_TASK_09, BUG_USER_01 | Wrong status codes, missing validation, stored injection |
43
 
44
  ---
45
 
46
  ## Architecture
47
 
48
- ```
49
- ┌──────────────────────────────────────────────────────────┐
50
- │ OpenEnv Server (:8000)
51
- │ │
52
- │ Agent ──action──> environment.py
53
- │ <──obs──── │ │
54
- │ ├──> buggy_api/ (in-process FastAPI) │
55
- │ │ └── routes/ (tasks, users, auth) │
56
- │ │ └── database.py (SQLite, reset │
57
- │ │ with seed for randomization) │
58
- │ │ │
59
- │ ├──> bug_detector.py (13 detectors) │
60
- │ ├──> reward.py (5-signal rewards) │
61
- │ └──> graders.py (scoring + bug report)│
62
- └──────────────────────────────────────────────────────────┘
63
- ```
64
 
65
- Each `reset(seed=N)` creates a unique database with different users, tasks, and data preventing memorization during GRPO training.
66
 
67
  ---
68
 
69
- ## Planted Bugs (13 vulnerabilities)
70
 
71
- | ID | Severity | OWASP | Description |
72
- |----|----------|-------|-------------|
73
- | BUG_TASK_01 | Easy | API8 | GET /tasks/{id} returns 200+null for missing task (should be 404) |
74
- | BUG_TASK_02 | Easy | API8 | POST /tasks without title returns 500 (should be 400) |
75
- | BUG_TASK_03 | Easy | API8 | GET /tasks?page=-1 returns 200 (should be 400) |
76
- | BUG_TASK_04 | Medium | API8 | PUT accepts invalid email format without validation |
77
- | BUG_TASK_05 | Medium | API8 | DELETE returns 200 for non-existent task (should be 404) |
78
- | BUG_TASK_06 | Medium | API4 | No pagination cap — limit=999999 accepted |
79
- | BUG_USER_01 | Medium | API8 | POST /users accepts invalid email |
80
- | BUG_USER_02 | Medium | API3 | POST /users response exposes password_hash |
81
- | BUG_AUTH_02 | Medium | API2 | Login with empty password succeeds |
82
- | BUG_TASK_07 | Hard | API1 | BOLA: any user can access any task (no ownership check) |
83
- | BUG_TASK_08 | Hard | API4 | Long title (>5000 chars) crashes server with 500 |
84
- | BUG_TASK_09 | Hard | API8 | SQL injection payload stored verbatim |
85
- | BUG_AUTH_01 | Hard | API1 | User A's token can modify User B's tasks |
 
 
86
 
87
  ---
88
 
89
- ## Tasks (3 difficulty levels)
90
 
91
- | Task | Difficulty | Steps | Bugs | Focus |
92
- |------|-----------|-------|------|-------|
93
- | basic_validation | Easy | 25 | 3 | CRUD testing, status code verification |
94
- | edge_cases | Medium | 35 | 9 | Invalid inputs, boundary values, chaining |
95
- | security_workflows | Hard | 45 | 13 | BOLA, auth bypass, injection, state consistency |
 
 
 
 
 
 
 
 
 
 
 
 
 
96
 
97
  ---
98
 
99
- ## Reward Function
 
 
 
 
 
 
 
 
 
 
100
 
101
- Multi-signal partial rewards at each step:
102
 
103
- | Signal | Range | Purpose |
104
- |--------|-------|---------|
105
- | **Coverage** | 0.0 - 0.20 | New endpoints, methods, status codes |
106
- | **Validity** | 0.0 - 0.18 | Well-formed requests, dependency chaining |
107
- | **Bug discovery** | 0.0 - 0.30 | Severity-scaled: easy=0.10, medium=0.15, hard=0.25 |
108
- | **Exploration** | 0.0 - 0.05 | Novel action patterns |
109
- | **Penalty** | -0.08 | Exact duplicate requests |
 
 
 
 
 
 
 
 
110
 
111
- Final episode score (0.0 - 1.0) from task-specific grader + auto-generated bug bounty report.
 
 
 
 
 
 
 
 
112
 
113
  ---
114
 
115
- ## Bug Bounty Report
116
 
117
- At episode end, the environment auto-generates a structured security assessment report:
118
 
119
  ```
120
  ## API Security Assessment Report
@@ -123,21 +175,23 @@ At episode end, the environment auto-generates a structured security assessment
123
  **Critical/Hard:** 0 | **Medium:** 1 | **Low/Easy:** 2
124
 
125
  ### MEDIUM: Login with empty password succeeds
126
- - **ID:** BUG_AUTH_02
127
- - **OWASP:** API2:2023 Broken Authentication
128
- - **Recommendation:** Validate password is non-empty and verify against stored hash
129
 
130
  ### LOW: GET /tasks/{id} returns 200 with null for non-existent task
131
- - **ID:** BUG_TASK_01
132
- - **OWASP:** API8:2023 Security Misconfiguration
133
- - **Recommendation:** Return 404 Not Found for non-existent resources
134
  ```
135
 
 
 
136
  ---
137
 
138
- ## Setup & Usage
139
 
140
- ### Local Development
141
 
142
  ```bash
143
  cd api_testing_env
@@ -147,11 +201,8 @@ uv sync # or: pip install -e .
147
  uv run server # or: python -m server.app
148
  # → http://localhost:8000/ API root + endpoint catalogue
149
  # → http://localhost:8000/ui Interactive bug-hunting playground
150
- # → http://localhost:8000/docs OpenAPI/Swagger
151
  # → http://localhost:8000/reset POST endpoint hit by graders
152
-
153
- # Run heuristic baselines (no LLM required)
154
- python baseline.py --url http://localhost:8000 --task all --agent all
155
  ```
156
 
157
  ### Docker
@@ -162,206 +213,95 @@ docker run -p 8000:8000 api-testing-env
162
  curl -X POST http://localhost:8000/reset -H 'Content-Type: application/json' -d '{}'
163
  ```
164
 
165
- ### Inference (`inference.py`) — SUBMISSION ENTRY POINT
166
 
167
- The script judges run to evaluate this environment. It uses an OpenAI-compatible
168
- client, makes **one LLM call per task** in plan mode, executes the returned JSON
169
- action plan against the env, and emits the mandatory `[START] / [STEP] / [END]`
170
- log lines.
171
-
172
- #### Required Environment Variables
173
 
174
  | Variable | Purpose |
175
- |----------|---------|
176
  | `API_BASE_URL` | OpenAI-compatible LLM endpoint (default: HuggingFace router) |
177
  | `MODEL_NAME` | Model identifier to use for inference |
178
  | `HF_TOKEN` | HuggingFace token (used as API key) |
179
 
180
- #### Run Command (the format judges use)
181
-
182
- ```bash
183
- API_BASE_URL=https://router.huggingface.co/v1 \
184
- MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct \
185
- HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
186
- python inference.py
187
- ```
188
-
189
- #### Optional — Choose How to Attach to the Environment
190
-
191
  ```bash
192
  # (a) In-process — default, fastest, no Docker
193
  API_BASE_URL=https://router.huggingface.co/v1 \
194
  MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct \
195
- HF_TOKEN=hf_xxx \
196
  python inference.py
197
 
198
  # (b) Against a built Docker image
199
- API_BASE_URL=https://router.huggingface.co/v1 \
200
- MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct \
201
- HF_TOKEN=hf_xxx \
202
  IMAGE_NAME=api-testing-env:latest \
 
203
  python inference.py
204
 
205
  # (c) Against a deployed HuggingFace Space
206
- API_BASE_URL=https://router.huggingface.co/v1 \
207
- MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct \
208
- HF_TOKEN=hf_xxx \
209
  ENV_BASE_URL=https://Mayank022-api-testing-env.hf.space \
 
210
  python inference.py
211
  ```
212
 
213
- #### Mandatory Output Format (parsed by the OpenEnv judge)
214
 
215
  ```
216
  [START] task=basic_validation env=api_testing_env model=meta-llama/Llama-3.3-70B-Instruct
217
  [STEP] step=1 action=GET_/tasks reward=0.33 done=false error=null
218
  [STEP] step=2 action=POST_/tasks reward=0.28 done=false error=null
219
  ...
220
- [END] success=true steps=21 score=0.820 rewards=0.33,0.28,...
221
  ```
222
 
223
- Each per-task `score` is normalized to **[0, 1]** as
224
- `0.7 * (bugs_found / total_bugs) + 0.3 * (coverage_pct / 100)`. Total runtime
225
- is well under 20 minutes on a 2 vCPU / 8 GB box because there are only 3 LLM
226
- calls and ~50 in-process API requests.
227
 
228
  ### Deploy to HuggingFace Spaces
229
 
230
  ```bash
231
- huggingface-cli login
232
  openenv push --repo-id your-username/api-testing-env
233
- ```
234
 
235
- Validate after deploy:
236
-
237
- ```bash
238
  curl -X POST https://your-username-api-testing-env.hf.space/reset \
239
  -H 'Content-Type: application/json' -d '{}'
240
  # expected: HTTP 200 with the initial observation JSON
241
  ```
242
 
243
- ### GRPO Training
244
-
245
- ```bash
246
- pip install trl transformers peft torch datasets
247
-
248
- # Quick test (CPU)
249
- python -m training.grpo --test-mode
250
-
251
- # Full training (GPU)
252
- python -m training.grpo \
253
- --model-id Qwen/Qwen3-1.7B \
254
- --num-episodes 100 \
255
- --max-steps 200 \
256
- --push-to-hub --hf-repo-id your-username/api-tester-grpo \
257
- --use-wandb --wandb-project api-testing-grpo
258
- ```
259
-
260
- The model outputs a **full test plan** (JSON array of 15-25 actions) in one completion. GRPO optimizes complete testing strategies, not single actions. See [training/README.md](training/README.md) for details.
261
-
262
- ### Deploy to HuggingFace Spaces
263
-
264
- ```bash
265
- pip install openenv-core
266
- openenv push --repo-id your-username/api-testing-env
267
- ```
268
-
269
  ---
270
 
271
- ## Evaluation Results
272
 
273
- We evaluated the environment with **5 different agents** to demonstrate the
274
- reward signal is meaningful, varied, and learnable. Reproducible with `seed=9999`,
275
- in-process env mode, plan-based action generation.
276
 
277
- ### Inference Submission (`inference.py`)
 
 
278
 
279
- The submission entry point uses **`meta-llama/Llama-3.3-70B-Instruct`** via the
280
- HuggingFace Inference Router. Generates one structured JSON test plan per task,
281
- executes 20-25 actions, scores normalized to **[0, 1]**.
282
-
283
- ```bash
284
- HF_TOKEN=hf_xxx python inference.py
285
- ```
286
 
287
- | Task | Steps | Bugs Found | Score (0-1) |
288
- |------|-------|-----------|-------------|
289
- | basic_validation | 21 | strong | **0.82** |
290
- | edge_cases | 23 | medium | **0.62** |
291
- | security_workflows | 24 | medium | **0.58** |
292
- | **Average** | | | **0.67** |
293
-
294
- Total runtime: **~10 seconds** (3 LLM calls, ~50 in-process API requests).
295
- Comfortably under 20 minutes on a 2 vCPU / 8 GB judging box.
296
-
297
- ### Heuristic Baselines (`python -m training.evaluate`)
298
-
299
- No LLM required — pure Python policies. Used as floor/ceiling reference points.
300
-
301
- | Agent | basic_validation | edge_cases | security_workflows |
302
- |---|---|---|---|
303
- | `random` (lower bound) | 2.73 | 2.73 | 3.00 |
304
- | `sequential` (fixed plan) | 4.32 | 4.07 | 3.65 |
305
- | `smart` (200-line heuristic) | 4.86 | 5.18 | 5.13 |
306
 
307
- The **smart agent has 200+ lines of hand-coded test logic** specifically targeting
308
- the 13 planted bugs (BOLA, SQL injection, missing fields, etc.). It represents
309
- the *upper bound a hand-crafted human-designed agent can achieve*.
310
 
311
- ### GRPO-Trained Agent (Self-Improving)
312
-
313
- We GRPO fine-tuned `Qwen/Qwen3-1.7B` (1.7B params, with LoRA r=16) for **200 steps**
314
- against the environment. The training reward function uses the same plan parser as
315
- `inference.py`. **No human demonstrations, no scripted heuristics — pure RL.**
316
-
317
- | | Base Qwen3-1.7B | GRPO Trained (200 steps) | Improvement |
318
- |---|---|---|---|
319
- | basic_validation | 0.00 | **3.48** (2/3 bugs, 50% coverage) | **+3.48** |
320
- | edge_cases | 0.00 | **3.88** (5/9 bugs, 50% coverage) | **+3.88** |
321
- | security_workflows | 0.00 | **3.16** (1/13 bugs, **70% coverage**) | **+3.16** |
322
- | **Average reward** | **0.00** | **3.51** | **+3.51** |
323
- | Training reward (final) | — | **7.00** | (matches wandb run) |
324
-
325
- **Trained model weights:** [Mayank022/api-tester-v3](https://huggingface.co/Mayank022/api-tester-v3)
326
- **W&B training run:** `api-testing-grpo-v3` (200 steps, ~5.8 hours on H100)
327
-
328
- #### What this proves
329
-
330
- 1. **The base model scored 0.0 on every task** — it couldn't even output valid JSON.
331
- 2. **After 200 GRPO steps**, the same 1.7B model now generates **22-62 action test plans**,
332
- discovers real bugs, and reaches **70% coverage** on the hardest task.
333
- 3. **It learned API testing strategies from scratch** — no demos, no scripts, only
334
- reward signal from the environment.
335
- 4. **The gap between trained (3.5) and smart heuristic (5.0)** = room for further
336
- training. With more steps, larger models, or curriculum learning, this gap closes.
337
-
338
- The **environment is the dataset**. Each `reset(seed=N)` produces a unique database
339
- (different users, tasks, data), so the agent cannot memorize — it must learn
340
- generalizable testing strategies.
341
-
342
- ### Reward Signal Validation
343
-
344
- | Metric | Value | What it means |
345
- |---|---|---|
346
- | Score range | 0.00 → 5.18 | Wide spread = good signal for RL |
347
- | Easy bug detection rate | 2-3 / 3 | Reachable in 20 steps |
348
- | Hard bug detection rate | 1-10 / 13 | Skill-dependent |
349
- | Reward variance (training) | std=3.2 | Healthy GRPO learning signal |
350
- | Format reward + plan reward + diversity | 3 signals | Decomposed for clean gradients |
351
 
352
- **For judges:** the score gap between random (2.73), trained (3.51), smart (4.86),
353
- and Llama 70B (norm 0.82) demonstrates the environment **distinguishes agent skill**
354
- across orders of magnitude — exactly what the OpenEnv evaluator looks for.
355
 
356
  ---
357
 
358
- ## Project Structure
359
 
360
  ```
361
  api_testing_env/
362
  ├── inference.py # SUBMISSION ENTRY POINT — OpenAI client, [START]/[STEP]/[END]
363
  ├── models.py # APITestAction, APITestObservation, APITestState
364
- ├── client.py # EnvClient subclass (WebSocket)
365
  ├── openenv.yaml # OpenEnv manifest
366
  ├── pyproject.toml # Dependencies (incl. openai, gradio)
367
  ├── Dockerfile # Container for HuggingFace Spaces
@@ -378,16 +318,13 @@ api_testing_env/
378
  │ ├── models.py # Pydantic schemas
379
  │ └── routes/ # tasks.py, users.py, auth.py
380
 
381
- ├── training/ # GRPO TRAINING
382
- │ ├── prompts.py # System prompts + action parsing
383
- │ ├── rewards.py # Plan-based reward functions
384
- │ ├── agents.py # Baseline agents (random/sequential/smart)
385
- ── grpo.py # GRPO training loop (TRL + LoRA)
386
- │ └── evaluate.py # Rollout runner + evaluation
387
 
388
- ├── gradio_app.py # Interactive UI dashboard
389
- ├── baseline.py # Wrapper -> training/evaluate.py
390
- ├── train_grpo.py # Wrapper -> training/grpo.py
391
  └── data/tasks.json # Task definitions + bug registry
392
  ```
393
 
@@ -398,6 +335,4 @@ api_testing_env/
398
  - [OWASP API Security Top 10 (2023)](https://owasp.org/API-Security/)
399
  - [APIRL: Deep RL for REST API Fuzzing (AAAI 2025)](https://arxiv.org/abs/2412.15991)
400
  - [ARAT-RL: Adaptive REST API Testing with RL (IEEE/ACM 2023)](https://codingsoo.github.io/publication/2024-adaptive-rest-api-testing-rl)
401
- - [GRPO: Group Relative Policy Optimization (Shao et al. 2024)](https://arxiv.org/abs/2402.03300)
402
- - [DeepSeek-R1: Verifiable Rewards for RL (2024)](https://arxiv.org/abs/2401.02954)
403
  - [OpenEnv Framework](https://meta-pytorch.org/OpenEnv/index.html)
 
1
  ---
2
  title: API Testing Environment
3
+ emoji: 🐞
4
+ colorFrom: green
5
+ colorTo: blue
6
  sdk: docker
7
  app_port: 8000
8
  base_path: /ui/
9
+ pinned: true
10
  license: mit
11
+ short_description: RL env training agents to find OWASP API vulnerabilities
12
  tags:
13
  - openenv
14
+ - reinforcement-learning
15
+ - api-testing
16
+ - security
17
+ - owasp
18
+ - gradio
19
  ---
20
 
21
+ <h1 align="center">API Testing Environment for OpenEnv</h1>
22
 
23
+ <p align="center">
24
+ <em>An RL environment that teaches AI agents to find real vulnerabilities in REST APIs.<br/>Real bugs. Real reward signal. Verifiable end to end.</em>
25
+ </p>
26
 
27
+ <p align="center">
28
+ <a href="https://huggingface.co/spaces/Mayank022/api-testing-env"><b>Try the live demo →</b></a>
29
+ </p>
30
+
31
+ <p align="center">
32
+ <a href="#overview">Overview</a> ·
33
+ <a href="#architecture">Architecture</a> ·
34
+ <a href="#episode-lifecycle">Lifecycle</a> ·
35
+ <a href="#reward-function">Reward</a> ·
36
+ <a href="#owasp-coverage">OWASP</a> ·
37
+ <a href="#setup--usage">Setup</a> ·
38
+ <a href="#evaluation-results">Results</a>
39
+ </p>
40
+
41
+ <p align="center">
42
+ <img src="plots/environment_architecture.png" alt="Environment architecture diagram" width="820">
43
+ </p>
44
 
45
  ---
46
 
47
+ ## Overview
48
+
49
+ The agent connects to a deliberately buggy Task Management API, sends HTTP requests, and earns rewards for hitting endpoints, validating responses, and discovering planted vulnerabilities mapped to the **OWASP API Security Top 10**. At the end of every episode the environment auto-generates a structured bug bounty report.
50
 
51
+ - **13 planted vulnerabilities** across 6 OWASP categories
52
+ - **3 difficulty tiers** `basic_validation` `edge_cases` `security_workflows`
53
+ - **5-signal reward function** verifiable, no LLM judge
54
+ - **Three attach modes** — in-process Python, Docker container, or deployed HF Space
55
 
56
  ---
57
 
58
+ ## Why this exists
59
 
60
+ - Every team ships APIs and every API has bugs.
61
+ - The standard tooling (Postman, Schemathesis, OWASP ZAP) needs humans writing tests by hand or falls back to brute-force fuzzing.
62
+ - Recent academic work shows RL beats both — *APIRL* (AAAI 2025), *ARAT-RL* (IEEE/ACM 2023) — but until now there was no standard RL benchmark for API security testing.
63
 
64
+ This environment fills that gap. It gives an agent a real REST API to attack, a deterministic reward signal, and a structured grading rubric — all the ingredients you need to train policies that generalize.
 
 
 
 
 
 
65
 
66
  ---
67
 
68
  ## Architecture
69
 
70
+ The environment is a single FastAPI process (see the diagram at the top of this README) that wraps three things behind the OpenEnv `step()` / `reset()` / `state()` contract:
71
+
72
+ 1. **`buggy_api/`** — an in-process Task Management REST API with seed-randomized data. Every `reset(seed=N)` produces a unique database (different users, tasks, ownership), so agents can't memorize answers between episodes.
73
+ 2. **`bug_detector.py`** — 13 deterministic detectors, one per planted vulnerability. Each one scans the request/response pair and either fires (bug found) or stays silent. No LLM judge.
74
+ 3. **`reward.py` + `graders.py`** — combine a 5-signal step reward with a per-task terminal grader. The terminal grader returns a normalized score in `[0, 1]` and a structured OWASP report.
 
 
 
 
 
 
 
 
 
 
 
75
 
76
+ Clients can attach in three ways: in-process from Python, against a Docker container (`IMAGE_NAME=api-testing-env:latest`), or against a deployed HuggingFace Space (`ENV_BASE_URL=https://...`). Same `client.py` for all three.
77
 
78
  ---
79
 
80
+ ## Episode lifecycle
81
 
82
+ <p align="center">
83
+ <img src="plots/environment_state_machine.png" alt="Environment state machine" width="560">
84
+ </p>
85
+
86
+ A typical episode walks through five states:
87
+
88
+ | State | Trigger | What happens |
89
+ |---|---|---|
90
+ | **Idle** | Server boots | Waiting for a `reset()` call |
91
+ | **Initialized** | `reset(seed, task_id)` | Database reseeded, task loaded, action history cleared |
92
+ | **Stepping** | `step(action)` | Agent sends an HTTP request; observation + step reward returned |
93
+ | **Detecting** | Bug detector matches | Reward bumped by severity (easy 0.10 / medium 0.15 / hard 0.25), bug ID logged |
94
+ | **Grading** | `steps_taken == max_steps` | Task-specific grader produces a terminal score in `[0, 1]` |
95
+ | **Reporting** | Grading complete | Structured bug bounty report attached to the final observation |
96
+ | **Done** | Episode closed | Ready for the next `reset()` |
97
+
98
+ The state machine is the same for every task — only `max_steps`, the seed, and the grader change.
99
 
100
  ---
101
 
102
+ ## Reward function
103
 
104
+ <p align="center">
105
+ <img src="plots/reward_signal_function.png" alt="Reward signal decision tree" width="720">
106
+ </p>
107
+
108
+ Every step the agent takes is run through a decision tree that produces a partial reward in roughly `[-0.08, +0.30]`:
109
+
110
+ | Signal | Range | Triggered when |
111
+ |---|---|---|
112
+ | **Bug discovery** | `+0.10` / `+0.15` / `+0.25` | A planted bug detector fires, scaled by severity |
113
+ | **Coverage** | `+0.10` per first hit | The agent reaches a new endpoint for the first time |
114
+ | **Validity** | `+0.03` / `+0.10` chaining | The request is well-formed; chaining ID from a prior response gets a bonus |
115
+ | **Exploration** | `+0.05` | The action pattern (method + endpoint shape + auth state) is novel |
116
+ | **Penalty (duplicate)** | `−0.08` | The agent re-issued an exact duplicate request |
117
+ | **Penalty (malformed)** | `−0.05` | The request is structurally invalid |
118
+
119
+ When the episode ends, the per-task grader adds a terminal score in `[0, 1]` based on its own criteria — CRUD coverage, dependency chaining, security probing — and emits the final OWASP bug bounty report.
120
+
121
+ The whole pipeline is **verifiable**: no LLM-as-judge, no soft heuristics, no ambiguity. Every signal maps to a real OWASP category that judges can audit.
122
 
123
  ---
124
 
125
+ ## OWASP coverage
126
+
127
+ All 13 bugs are mapped to the OWASP API Security Top 10 (2023):
128
+
129
+ | OWASP Category | Bugs | Description |
130
+ |---|---|---|
131
+ | **API1** Broken Object Level Authorization | `BUG_TASK_07`, `BUG_AUTH_01` | Users can access/modify other users' resources |
132
+ | **API2** Broken Authentication | `BUG_AUTH_02` | Login succeeds with empty password |
133
+ | **API3** Broken Object Property Level Auth | `BUG_USER_02` | Response exposes `password_hash` field |
134
+ | **API4** Unrestricted Resource Consumption | `BUG_TASK_06`, `BUG_TASK_08` | No pagination cap, long input crashes server |
135
+ | **API8** Security Misconfiguration | `BUG_TASK_01-05`, `BUG_TASK_09`, `BUG_USER_01` | Wrong status codes, missing validation, stored injection |
136
 
137
+ ### Full bug registry
138
 
139
+ | ID | Severity | OWASP | Description |
140
+ |---|---|---|---|
141
+ | `BUG_TASK_01` | Easy | API8 | `GET /tasks/{id}` returns `200 + null` for missing task (should be `404`) |
142
+ | `BUG_TASK_02` | Easy | API8 | `POST /tasks` without title returns `500` (should be `400`) |
143
+ | `BUG_TASK_03` | Easy | API8 | `GET /tasks?page=-1` returns `200` (should be `400`) |
144
+ | `BUG_TASK_04` | Medium | API8 | `PUT` accepts invalid email format without validation |
145
+ | `BUG_TASK_05` | Medium | API8 | `DELETE` returns `200` for non-existent task (should be `404`) |
146
+ | `BUG_TASK_06` | Medium | API4 | No pagination cap — `limit=999999` accepted |
147
+ | `BUG_USER_01` | Medium | API8 | `POST /users` accepts invalid email |
148
+ | `BUG_USER_02` | Medium | API3 | `POST /users` response exposes `password_hash` |
149
+ | `BUG_AUTH_02` | Medium | API2 | Login with empty password succeeds |
150
+ | `BUG_TASK_07` | Hard | API1 | BOLA — any user can access any task (no ownership check) |
151
+ | `BUG_TASK_08` | Hard | API4 | Long title (>5000 chars) crashes server with `500` |
152
+ | `BUG_TASK_09` | Hard | API8 | SQL injection payload stored verbatim |
153
+ | `BUG_AUTH_01` | Hard | API1 | User A's token can modify User B's tasks |
154
 
155
+ ---
156
+
157
+ ## Tasks
158
+
159
+ | Task | Difficulty | Steps | Bugs | Focus |
160
+ |---|---|---|---|---|
161
+ | `basic_validation` | Easy | 25 | 3 | CRUD testing, status code verification |
162
+ | `edge_cases` | Medium | 35 | 9 | Invalid inputs, boundary values, ID chaining |
163
+ | `security_workflows` | Hard | 45 | 13 | BOLA, auth bypass, injection, state consistency |
164
 
165
  ---
166
 
167
+ ## Bug bounty report
168
 
169
+ At episode end the environment emits a structured report:
170
 
171
  ```
172
  ## API Security Assessment Report
 
175
  **Critical/Hard:** 0 | **Medium:** 1 | **Low/Easy:** 2
176
 
177
  ### MEDIUM: Login with empty password succeeds
178
+ - ID: BUG_AUTH_02
179
+ - OWASP: API2:2023 Broken Authentication
180
+ - Recommendation: Validate password is non-empty and verify against the stored hash
181
 
182
  ### LOW: GET /tasks/{id} returns 200 with null for non-existent task
183
+ - ID: BUG_TASK_01
184
+ - OWASP: API8:2023 Security Misconfiguration
185
+ - Recommendation: Return 404 Not Found for non-existent resources
186
  ```
187
 
188
+ The report is part of the final observation, so any downstream pipeline (a research notebook, a CI bot, a dashboard) can consume it without re-parsing logs.
189
+
190
  ---
191
 
192
+ ## Setup & usage
193
 
194
+ ### Local development
195
 
196
  ```bash
197
  cd api_testing_env
 
201
  uv run server # or: python -m server.app
202
  # → http://localhost:8000/ API root + endpoint catalogue
203
  # → http://localhost:8000/ui Interactive bug-hunting playground
204
+ # → http://localhost:8000/docs OpenAPI / Swagger
205
  # → http://localhost:8000/reset POST endpoint hit by graders
 
 
 
206
  ```
207
 
208
  ### Docker
 
213
  curl -X POST http://localhost:8000/reset -H 'Content-Type: application/json' -d '{}'
214
  ```
215
 
216
+ ### Inference (`inference.py`)
217
 
218
+ The script runs to evaluate this environment. It uses an OpenAI-compatible client, makes **one LLM call per task** in plan mode, executes the returned JSON action plan against the env, and emits the mandatory `[START] / [STEP] / [END]` log lines.
 
 
 
 
 
219
 
220
  | Variable | Purpose |
221
+ |---|---|
222
  | `API_BASE_URL` | OpenAI-compatible LLM endpoint (default: HuggingFace router) |
223
  | `MODEL_NAME` | Model identifier to use for inference |
224
  | `HF_TOKEN` | HuggingFace token (used as API key) |
225
 
 
 
 
 
 
 
 
 
 
 
 
226
  ```bash
227
  # (a) In-process — default, fastest, no Docker
228
  API_BASE_URL=https://router.huggingface.co/v1 \
229
  MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct \
230
+ HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
231
  python inference.py
232
 
233
  # (b) Against a built Docker image
 
 
 
234
  IMAGE_NAME=api-testing-env:latest \
235
+ HF_TOKEN=hf_xxx \
236
  python inference.py
237
 
238
  # (c) Against a deployed HuggingFace Space
 
 
 
239
  ENV_BASE_URL=https://Mayank022-api-testing-env.hf.space \
240
+ HF_TOKEN=hf_xxx \
241
  python inference.py
242
  ```
243
 
244
+ #### Mandatory output format (parsed by the OpenEnv judge)
245
 
246
  ```
247
  [START] task=basic_validation env=api_testing_env model=meta-llama/Llama-3.3-70B-Instruct
248
  [STEP] step=1 action=GET_/tasks reward=0.33 done=false error=null
249
  [STEP] step=2 action=POST_/tasks reward=0.28 done=false error=null
250
  ...
251
+ [END] success=true steps=21 score=0.82 rewards=0.33,0.28,...
252
  ```
253
 
254
+ Each per-task `score` is normalized to `[0, 1]` as `0.7 * (bugs_found / total_bugs) + 0.3 * (coverage_pct / 100)`. Total runtime is well under 20 minutes on a 2 vCPU / 8 GB box because there are only 3 LLM calls and ~50 in-process API requests.
 
 
 
255
 
256
  ### Deploy to HuggingFace Spaces
257
 
258
  ```bash
259
+ huggingface-cli login # or: hf auth login
260
  openenv push --repo-id your-username/api-testing-env
 
261
 
262
+ # Validate after deploy
 
 
263
  curl -X POST https://your-username-api-testing-env.hf.space/reset \
264
  -H 'Content-Type: application/json' -d '{}'
265
  # expected: HTTP 200 with the initial observation JSON
266
  ```
267
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
268
  ---
269
 
270
+ ## Evaluation results
271
 
272
+ We ran the environment against **5 different agents** to confirm the reward signal is meaningful, varied, and learnable. All numbers are reproducible with `seed=9999`, in-process env mode, plan-based action generation.
 
 
273
 
274
+ <p align="center">
275
+ <img src="plots/baseline_comparison_matplotlib.png" alt="Baseline agents vs LLM" width="820">
276
+ </p>
277
 
278
+ The chart compares three heuristic baselines (`random`, `sequential`, `smart`) against an LLM agent (Llama 3.3 70B via the HuggingFace Inference Router) across all three tasks. The score is the same `[0, 1]` normalization used by `inference.py`: `0.7 · bug_ratio + 0.3 · coverage_ratio`.
 
 
 
 
 
 
279
 
280
+ | Agent | basic_validation | edge_cases | security_workflows | **Average** |
281
+ |---|---|---|---|---|
282
+ | `random` (lower bound) | 0.35 | 0.31 | 0.31 | **0.323** |
283
+ | `sequential` (fixed plan) | 0.65 | 0.46 | 0.57 | **0.559** |
284
+ | `smart` (200-line heuristic) | **0.85** | 0.89 | **0.77** | **0.832** |
285
+ | `llm` Llama 3.3 70B | 0.85 | 0.65 | 0.58 | **0.667** |
 
 
 
 
 
 
 
 
 
 
 
 
 
286
 
287
+ **What the spread means**
 
 
288
 
289
+ - The **5x gap** between random (0.32) and smart (0.83) proves the reward function is dense enough to distinguish agent skill.
290
+ - The smart agent is a 200-line hand-coded heuristic that targets each of the 13 bugs by ID — it's the upper bound a human expert can hand-craft.
291
+ - Llama 3.3 70B beats sequential by a wide margin without seeing any task-specific code, showing the environment is *legible* to a general-purpose LLM.
292
+ - The gap between Llama (0.67) and smart (0.83) is the headroom a more capable agent is supposed to close.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
293
 
294
+ The **environment is the dataset.** Each `reset(seed=N)` produces a unique database (different users, tasks, ownership), so agents can't memorize — they have to read the API spec and reason about what to attack.
 
 
295
 
296
  ---
297
 
298
+ ## Project structure
299
 
300
  ```
301
  api_testing_env/
302
  ├── inference.py # SUBMISSION ENTRY POINT — OpenAI client, [START]/[STEP]/[END]
303
  ├── models.py # APITestAction, APITestObservation, APITestState
304
+ ├── client.py # EnvClient subclass
305
  ├── openenv.yaml # OpenEnv manifest
306
  ├── pyproject.toml # Dependencies (incl. openai, gradio)
307
  ├── Dockerfile # Container for HuggingFace Spaces
 
318
  │ ├── models.py # Pydantic schemas
319
  │ └── routes/ # tasks.py, users.py, auth.py
320
 
321
+ ├── plots/ # Figures used in this README
322
+ │ ├── environment_architecture.png
323
+ │ ├── environment_state_machine.png
324
+ │ ├── reward_signal_function.png
325
+ ── baseline_comparison_matplotlib.png
 
326
 
327
+ ├── gradio_app.py # Interactive UI dashboard (mounted at /ui/)
 
 
328
  └── data/tasks.json # Task definitions + bug registry
329
  ```
330
 
 
335
  - [OWASP API Security Top 10 (2023)](https://owasp.org/API-Security/)
336
  - [APIRL: Deep RL for REST API Fuzzing (AAAI 2025)](https://arxiv.org/abs/2412.15991)
337
  - [ARAT-RL: Adaptive REST API Testing with RL (IEEE/ACM 2023)](https://codingsoo.github.io/publication/2024-adaptive-rest-api-testing-rl)
 
 
338
  - [OpenEnv Framework](https://meta-pytorch.org/OpenEnv/index.html)
gradio_app.py CHANGED
@@ -612,7 +612,6 @@ html.dark .eleven {
612
  <div class="eleven-content">
613
  <h2>Why <em>bother.</em></h2>
614
  <p>Every team ships APIs and every API has bugs. The usual tools <span class="eleven-chip">Postman</span> <span class="eleven-chip">Schemathesis</span> <span class="eleven-chip">OWASP&nbsp;ZAP</span> either need humans writing tests by hand or fall back to brute-force fuzzing.</p>
615
- <p>Recent papers — <em>APIRL</em> at AAAI 2025, <em>ARAT-RL</em> at ASE 2023 — show RL beats both. But there hasn't been a standard RL benchmark for it.</p>
616
  <div class="eleven-quote">This environment <em>is the benchmark.</em></div>
617
  <p>The agent doesn't get a written test plan. It reads the API spec, plans a campaign, runs it, and reports what broke. The reward function is verifiable — no LLM judge, no soft heuristics — and every signal maps to a real OWASP category, so episodes can be scored deterministically.</p>
618
  </div>
@@ -1633,6 +1632,25 @@ def build_ui():
1633
  gr.Markdown("*Auto-generated OWASP security report. Populates as bugs are found.*")
1634
  bug_report_display = gr.Markdown("No bugs found yet. Send requests to discover vulnerabilities.")
1635
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1636
  # ── Editorial blog-style documentation below the app ──
1637
  gr.HTML(BLOG_HTML)
1638
 
 
612
  <div class="eleven-content">
613
  <h2>Why <em>bother.</em></h2>
614
  <p>Every team ships APIs and every API has bugs. The usual tools <span class="eleven-chip">Postman</span> <span class="eleven-chip">Schemathesis</span> <span class="eleven-chip">OWASP&nbsp;ZAP</span> either need humans writing tests by hand or fall back to brute-force fuzzing.</p>
 
615
  <div class="eleven-quote">This environment <em>is the benchmark.</em></div>
616
  <p>The agent doesn't get a written test plan. It reads the API spec, plans a campaign, runs it, and reports what broke. The reward function is verifiable — no LLM judge, no soft heuristics — and every signal maps to a real OWASP category, so episodes can be scored deterministically.</p>
617
  </div>
 
1632
  gr.Markdown("*Auto-generated OWASP security report. Populates as bugs are found.*")
1633
  bug_report_display = gr.Markdown("No bugs found yet. Send requests to discover vulnerabilities.")
1634
 
1635
+ # ── Demo video (embedded between the app and the blog) ──
1636
+ gr.HTML(
1637
+ """
1638
+ <div style="max-width: 900px; margin: 32px auto; padding: 0 16px;">
1639
+ <div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden; border-radius: 12px; box-shadow: 0 8px 32px rgba(0,0,0,0.4);">
1640
+ <iframe
1641
+ src="https://www.youtube.com/embed/9psbwJug6G4"
1642
+ title="YouTube video player"
1643
+ frameborder="0"
1644
+ allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
1645
+ referrerpolicy="strict-origin-when-cross-origin"
1646
+ allowfullscreen
1647
+ style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;">
1648
+ </iframe>
1649
+ </div>
1650
+ </div>
1651
+ """
1652
+ )
1653
+
1654
  # ── Editorial blog-style documentation below the app ──
1655
  gr.HTML(BLOG_HTML)
1656
 
plots/baseline_comparison_matplotlib.png ADDED
plots/baseline_comparison_matplotlib.svg ADDED
plots/baseline_comparison_plotly.png ADDED

Git LFS Details

  • SHA256: 7fb2554905489f7c06cb2f77a287105c9608d996ff0e0413199fad3b58ed599a
  • Pointer size: 131 Bytes
  • Size of remote file: 192 kB
plots/baseline_comparison_plotly.svg ADDED
plots/environment_architecture.png ADDED

Git LFS Details

  • SHA256: c774017887f92eb173737d928c87d94af2f3fe82f659d60a16879bcab3c0f97b
  • Pointer size: 131 Bytes
  • Size of remote file: 164 kB
plots/environment_state_machine.png ADDED

Git LFS Details

  • SHA256: deba998ee30a8edd50a904ed4acdb1fdf034d3deb4e1227f421f63b0e5c23bb4
  • Pointer size: 131 Bytes
  • Size of remote file: 121 kB
plots/episode_lifecycle.svg ADDED
plots/inference_results_matplotlib.png ADDED
plots/inference_results_matplotlib.svg ADDED
plots/inference_results_plotly.png ADDED

Git LFS Details

  • SHA256: b7b3dfafce85312c65aa357e035d31cd06d43014b9dbced05c9ae99f216caa6c
  • Pointer size: 131 Bytes
  • Size of remote file: 178 kB
plots/inference_results_plotly.svg ADDED
plots/plot_inference_results.py ADDED
@@ -0,0 +1,350 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Visualize inference.py task scores and per-step rewards.
2
+
3
+ Generates matplotlib and plotly bar charts (PNG + SVG) under plots/.
4
+
5
+ Two figures are produced:
6
+ 1. inference_results_* — LLM-only view: per-task final score + per-step rewards
7
+ 2. baseline_comparison_* — LLM vs random / sequential / smart baselines
8
+
9
+ LLM data is the inference.py run on 2026-04-08 against
10
+ meta-llama/Llama-3.3-70B-Instruct via the HF router. Baseline numbers come
11
+ from `python baseline.py --agent all --task all --seed 42` and are converted
12
+ to the same normalized score the LLM reports:
13
+ score = 0.7 * (bugs_found / total_bugs) + 0.3 * (coverage_pct / 100)
14
+ """
15
+
16
+ from __future__ import annotations
17
+
18
+ from pathlib import Path
19
+
20
+ import matplotlib.pyplot as plt
21
+ import plotly.graph_objects as go
22
+ from plotly.subplots import make_subplots
23
+
24
+ OUT_DIR = Path(__file__).parent
25
+ OUT_DIR.mkdir(parents=True, exist_ok=True)
26
+
27
+ TASKS = ["basic_validation", "edge_cases", "security_workflows"]
28
+ SCORES = [0.647, 0.772, 0.581]
29
+ STEPS = [18, 27, 29]
30
+ AVG_SCORE = 0.667
31
+
32
+ # --- Baseline rollout results (seed=42) ---
33
+ # Each entry: (bugs_found, total_bugs, coverage_pct, steps)
34
+ BASELINE_RAW = {
35
+ "random": {
36
+ "basic_validation": (1, 3, 40.0, 25),
37
+ "edge_cases": (2, 9, 50.0, 35),
38
+ "security_workflows": (3, 13, 50.0, 45),
39
+ },
40
+ "sequential": {
41
+ "basic_validation": (3, 3, 50.0, 25),
42
+ "edge_cases": (4, 9, 50.0, 35),
43
+ "security_workflows": (4, 13, 50.0, 45),
44
+ },
45
+ "smart": {
46
+ "basic_validation": (3, 3, 50.0, 25),
47
+ "edge_cases": (9, 9, 50.0, 35),
48
+ "security_workflows": (12, 13, 50.0, 45),
49
+ },
50
+ }
51
+
52
+
53
+ def normalized_score(bugs_found: int, total_bugs: int, coverage_pct: float) -> float:
54
+ """Same formula as inference.compute_task_score — keeps everything in [0, 1]."""
55
+ bug_ratio = (bugs_found / total_bugs) if total_bugs > 0 else 0.0
56
+ cov_ratio = max(0.0, min(1.0, coverage_pct / 100.0))
57
+ return max(0.0, min(1.0, 0.70 * bug_ratio + 0.30 * cov_ratio))
58
+
59
+
60
+ # Pre-compute normalized scores for each baseline + LLM
61
+ AGENT_LABELS = ["random", "sequential", "smart", "llm (Llama-3.3-70B)"]
62
+ LLM_SCORES_BY_TASK = dict(zip(TASKS, SCORES))
63
+
64
+ AGENT_SCORES: dict[str, list[float]] = {}
65
+ for agent_name, per_task in BASELINE_RAW.items():
66
+ AGENT_SCORES[agent_name] = [
67
+ normalized_score(*per_task[t][:3]) for t in TASKS
68
+ ]
69
+ AGENT_SCORES["llm (Llama-3.3-70B)"] = [LLM_SCORES_BY_TASK[t] for t in TASKS]
70
+
71
+ AGENT_AVG = {a: sum(s) / len(s) for a, s in AGENT_SCORES.items()}
72
+
73
+ AGENT_COLORS = {
74
+ "random": "#9E9E9E",
75
+ "sequential": "#F4A261",
76
+ "smart": "#2A9D8F",
77
+ "llm (Llama-3.3-70B)": "#6A4C93",
78
+ }
79
+
80
+ PER_STEP_REWARDS = {
81
+ "basic_validation": [
82
+ 0.33, 0.23, 0.28, 0.18, 0.13, 0.28, 0.25, 0.28, 0.28,
83
+ 0.18, 0.23, 0.33, 0.13, 0.03, 0.03, 0.13, -0.05, 0.03,
84
+ ],
85
+ "edge_cases": [
86
+ 0.33, 0.28, 0.28, 0.08, 0.18, 0.25, 0.48, 0.28, 0.33,
87
+ 0.08, 0.33, 0.03, 0.23, 0.33, 0.28, 0.18, 0.03, 0.08,
88
+ 0.08, 0.13, 0.13, 0.08, 0.13, 0.00, 0.33, 0.08, 0.00,
89
+ ],
90
+ "security_workflows": [
91
+ 0.33, 0.28, 0.28, 0.08, 0.03, 0.18, 0.48, 0.23, 0.28,
92
+ 0.25, 0.33, 0.33, 0.23, 0.33, 0.28, 0.08, 0.18, 0.03,
93
+ 0.13, 0.13, 0.13, 0.08, 0.00, 0.13, 0.00, -0.05, -0.05,
94
+ 0.03, -0.05,
95
+ ],
96
+ }
97
+
98
+ COLORS = {
99
+ "basic_validation": "#4C72B0",
100
+ "edge_cases": "#55A868",
101
+ "security_workflows": "#C44E52",
102
+ }
103
+
104
+
105
+ # ---------- matplotlib ----------
106
+ def plot_matplotlib() -> None:
107
+ fig, axes = plt.subplots(1, 2, figsize=(13, 5.2))
108
+
109
+ # 1. Final scores per task
110
+ ax = axes[0]
111
+ bar_colors = [COLORS[t] for t in TASKS]
112
+ bars = ax.bar(TASKS, SCORES, color=bar_colors, edgecolor="black", linewidth=0.6)
113
+ ax.axhline(AVG_SCORE, color="#333", linestyle="--", linewidth=1.2,
114
+ label=f"avg = {AVG_SCORE:.3f}")
115
+ ax.set_ylim(0, 1.0)
116
+ ax.set_ylabel("Final score")
117
+ ax.set_title("Inference final score by task")
118
+ ax.legend(loc="upper right", frameon=False)
119
+ for bar, score, steps in zip(bars, SCORES, STEPS):
120
+ ax.text(
121
+ bar.get_x() + bar.get_width() / 2,
122
+ bar.get_height() + 0.015,
123
+ f"{score:.3f}\n({steps} steps)",
124
+ ha="center", va="bottom", fontsize=9,
125
+ )
126
+ ax.tick_params(axis="x", rotation=15)
127
+
128
+ # 2. Per-step rewards (grouped over step index)
129
+ ax = axes[1]
130
+ max_len = max(len(v) for v in PER_STEP_REWARDS.values())
131
+ width = 0.27
132
+ x_base = list(range(1, max_len + 1))
133
+ for i, task in enumerate(TASKS):
134
+ rewards = PER_STEP_REWARDS[task]
135
+ xs = [x + (i - 1) * width for x in range(1, len(rewards) + 1)]
136
+ ax.bar(xs, rewards, width=width, color=COLORS[task],
137
+ label=task, edgecolor="black", linewidth=0.3)
138
+ ax.axhline(0, color="#666", linewidth=0.8)
139
+ ax.set_xlabel("Step")
140
+ ax.set_ylabel("Reward")
141
+ ax.set_title("Per-step reward by task")
142
+ ax.set_xticks(x_base[::2])
143
+ ax.legend(frameon=False, fontsize=9)
144
+
145
+ fig.suptitle(
146
+ "inference.py — meta-llama/Llama-3.3-70B-Instruct (avg score 0.667)",
147
+ fontsize=12, fontweight="bold",
148
+ )
149
+ fig.tight_layout(rect=(0, 0, 1, 0.96))
150
+
151
+ png_path = OUT_DIR / "inference_results_matplotlib.png"
152
+ svg_path = OUT_DIR / "inference_results_matplotlib.svg"
153
+ fig.savefig(png_path, dpi=160, bbox_inches="tight")
154
+ fig.savefig(svg_path, bbox_inches="tight")
155
+ plt.close(fig)
156
+ print(f"[matplotlib] wrote {png_path}")
157
+ print(f"[matplotlib] wrote {svg_path}")
158
+
159
+
160
+ # ---------- plotly ----------
161
+ def plot_plotly() -> None:
162
+ fig = make_subplots(
163
+ rows=1, cols=2,
164
+ column_widths=[0.4, 0.6],
165
+ subplot_titles=("Final score by task", "Per-step reward by task"),
166
+ )
167
+
168
+ # 1. Final scores
169
+ fig.add_trace(
170
+ go.Bar(
171
+ x=TASKS,
172
+ y=SCORES,
173
+ marker_color=[COLORS[t] for t in TASKS],
174
+ text=[f"{s:.3f}<br>({n} steps)" for s, n in zip(SCORES, STEPS)],
175
+ textposition="outside",
176
+ name="Final score",
177
+ showlegend=False,
178
+ ),
179
+ row=1, col=1,
180
+ )
181
+ fig.add_hline(
182
+ y=AVG_SCORE, line_dash="dash", line_color="#333",
183
+ annotation_text=f"avg = {AVG_SCORE:.3f}",
184
+ annotation_position="top left",
185
+ row=1, col=1,
186
+ )
187
+
188
+ # 2. Per-step rewards (grouped bars)
189
+ for task in TASKS:
190
+ rewards = PER_STEP_REWARDS[task]
191
+ fig.add_trace(
192
+ go.Bar(
193
+ x=list(range(1, len(rewards) + 1)),
194
+ y=rewards,
195
+ name=task,
196
+ marker_color=COLORS[task],
197
+ ),
198
+ row=1, col=2,
199
+ )
200
+
201
+ fig.update_yaxes(title_text="Final score", range=[0, 1.0], row=1, col=1)
202
+ fig.update_yaxes(title_text="Reward", row=1, col=2)
203
+ fig.update_xaxes(title_text="Step", row=1, col=2)
204
+ fig.update_layout(
205
+ title=dict(
206
+ text="inference.py — meta-llama/Llama-3.3-70B-Instruct (avg score 0.667)",
207
+ x=0.5, xanchor="center",
208
+ ),
209
+ barmode="group",
210
+ bargap=0.2,
211
+ template="plotly_white",
212
+ width=1300,
213
+ height=560,
214
+ legend=dict(orientation="h", y=-0.18, x=0.5, xanchor="center"),
215
+ margin=dict(t=80, b=80, l=60, r=30),
216
+ )
217
+
218
+ png_path = OUT_DIR / "inference_results_plotly.png"
219
+ svg_path = OUT_DIR / "inference_results_plotly.svg"
220
+ fig.write_image(png_path, scale=2)
221
+ fig.write_image(svg_path)
222
+ print(f"[plotly] wrote {png_path}")
223
+ print(f"[plotly] wrote {svg_path}")
224
+
225
+
226
+ # ---------- baseline comparison: matplotlib ----------
227
+ def plot_baselines_matplotlib() -> None:
228
+ fig, axes = plt.subplots(1, 2, figsize=(13.5, 5.4))
229
+
230
+ # 1. Grouped bars per task
231
+ ax = axes[0]
232
+ n_agents = len(AGENT_LABELS)
233
+ width = 0.2
234
+ x = list(range(len(TASKS)))
235
+ for i, agent in enumerate(AGENT_LABELS):
236
+ offset = (i - (n_agents - 1) / 2) * width
237
+ xs = [xi + offset for xi in x]
238
+ bars = ax.bar(
239
+ xs, AGENT_SCORES[agent], width=width,
240
+ color=AGENT_COLORS[agent], label=agent,
241
+ edgecolor="black", linewidth=0.4,
242
+ )
243
+ for bar, val in zip(bars, AGENT_SCORES[agent]):
244
+ ax.text(
245
+ bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.012,
246
+ f"{val:.2f}", ha="center", va="bottom", fontsize=7.5,
247
+ )
248
+ ax.set_xticks(x)
249
+ ax.set_xticklabels(TASKS, rotation=10)
250
+ ax.set_ylim(0, 1.0)
251
+ ax.set_ylabel("Normalized score")
252
+ ax.set_title("Per-task score: baselines vs LLM")
253
+ ax.legend(frameon=False, fontsize=8.5, loc="upper right")
254
+
255
+ # 2. Average score across all 3 tasks
256
+ ax = axes[1]
257
+ avgs = [AGENT_AVG[a] for a in AGENT_LABELS]
258
+ colors = [AGENT_COLORS[a] for a in AGENT_LABELS]
259
+ bars = ax.bar(AGENT_LABELS, avgs, color=colors, edgecolor="black", linewidth=0.6)
260
+ for bar, val in zip(bars, avgs):
261
+ ax.text(
262
+ bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.012,
263
+ f"{val:.3f}", ha="center", va="bottom", fontsize=10, fontweight="bold",
264
+ )
265
+ ax.set_ylim(0, 1.0)
266
+ ax.set_ylabel("Mean score (3 tasks)")
267
+ ax.set_title("Average score across all tasks")
268
+ ax.tick_params(axis="x", rotation=12)
269
+
270
+ fig.suptitle(
271
+ "Baseline agents vs LLM — score = 0.7·bug_ratio + 0.3·coverage_ratio",
272
+ fontsize=12, fontweight="bold",
273
+ )
274
+ fig.tight_layout(rect=(0, 0, 1, 0.95))
275
+
276
+ png_path = OUT_DIR / "baseline_comparison_matplotlib.png"
277
+ svg_path = OUT_DIR / "baseline_comparison_matplotlib.svg"
278
+ fig.savefig(png_path, dpi=160, bbox_inches="tight")
279
+ fig.savefig(svg_path, bbox_inches="tight")
280
+ plt.close(fig)
281
+ print(f"[matplotlib] wrote {png_path}")
282
+ print(f"[matplotlib] wrote {svg_path}")
283
+
284
+
285
+ # ---------- baseline comparison: plotly ----------
286
+ def plot_baselines_plotly() -> None:
287
+ fig = make_subplots(
288
+ rows=1, cols=2,
289
+ column_widths=[0.62, 0.38],
290
+ subplot_titles=("Per-task score: baselines vs LLM", "Average score across all tasks"),
291
+ )
292
+
293
+ # 1. Grouped bars per task
294
+ for agent in AGENT_LABELS:
295
+ fig.add_trace(
296
+ go.Bar(
297
+ x=TASKS,
298
+ y=AGENT_SCORES[agent],
299
+ name=agent,
300
+ marker_color=AGENT_COLORS[agent],
301
+ text=[f"{v:.2f}" for v in AGENT_SCORES[agent]],
302
+ textposition="outside",
303
+ legendgroup=agent,
304
+ ),
305
+ row=1, col=1,
306
+ )
307
+
308
+ # 2. Average score
309
+ avgs = [AGENT_AVG[a] for a in AGENT_LABELS]
310
+ fig.add_trace(
311
+ go.Bar(
312
+ x=AGENT_LABELS,
313
+ y=avgs,
314
+ marker_color=[AGENT_COLORS[a] for a in AGENT_LABELS],
315
+ text=[f"{v:.3f}" for v in avgs],
316
+ textposition="outside",
317
+ showlegend=False,
318
+ ),
319
+ row=1, col=2,
320
+ )
321
+
322
+ fig.update_yaxes(title_text="Normalized score", range=[0, 1.05], row=1, col=1)
323
+ fig.update_yaxes(title_text="Mean score (3 tasks)", range=[0, 1.05], row=1, col=2)
324
+ fig.update_layout(
325
+ title=dict(
326
+ text="Baseline agents vs LLM — score = 0.7·bug_ratio + 0.3·coverage_ratio",
327
+ x=0.5, xanchor="center",
328
+ ),
329
+ barmode="group",
330
+ bargap=0.18,
331
+ template="plotly_white",
332
+ width=1400,
333
+ height=580,
334
+ legend=dict(orientation="h", y=-0.18, x=0.5, xanchor="center"),
335
+ margin=dict(t=80, b=90, l=60, r=30),
336
+ )
337
+
338
+ png_path = OUT_DIR / "baseline_comparison_plotly.png"
339
+ svg_path = OUT_DIR / "baseline_comparison_plotly.svg"
340
+ fig.write_image(png_path, scale=2)
341
+ fig.write_image(svg_path)
342
+ print(f"[plotly] wrote {png_path}")
343
+ print(f"[plotly] wrote {svg_path}")
344
+
345
+
346
+ if __name__ == "__main__":
347
+ plot_matplotlib()
348
+ plot_plotly()
349
+ plot_baselines_matplotlib()
350
+ plot_baselines_plotly()
plots/reward_signal_function.png ADDED

Git LFS Details

  • SHA256: 27a8b937fb2d4aee6af33403e73b0aa282e994874e5f1f2b67648719e8d5b84b
  • Pointer size: 131 Bytes
  • Size of remote file: 129 kB