anshumanatrey commited on
Commit
4057030
·
verified ·
1 Parent(s): 97aee49

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +22 -10
  2. server/tools.py +378 -69
README.md CHANGED
@@ -99,25 +99,37 @@ with SecurityAuditEnv(base_url="http://localhost:8000").sync() as env:
99
  ## Tasks (3 Scenarios)
100
 
101
  ### Easy: Startup Web App Audit
102
- 2 hosts, 3 vulnerabilities (SQLi, default credentials, exposed database). All discoverable with basic scans. Max 30 steps.
103
 
104
  ### Medium: E-commerce Platform Audit
105
- 4 hosts (2 initially hidden behind firewall), 6 vulnerabilities (SSRF, IDOR, hardcoded secrets, unauthenticated Jenkins, weak credentials, outdated TLS). SSRF discovery reveals internal hosts. Attack chaining required. Max 50 steps.
106
 
107
  ### Hard: Enterprise SOC2 Pre-Audit
108
- 6 hosts (3 initially hidden on internal network), 10 vulnerabilities (stored XSS, BOLA, race condition, SSTI, file upload, weak creds, missing encryption, email misconfiguration, vulnerable component, missing rate limiting). Includes honeypot decoy. Progressive network discovery compromise external hosts to pivot to internal network. Max 60 steps.
 
 
 
 
 
 
 
 
 
 
 
 
109
 
110
  ## Baseline Scores
111
 
112
- Scores from a deterministic audit agent (no LLM) that scans, crawls endpoints, tests each individually, parses output for detections, submits findings, and pivots through discovered vulns to unlock hidden hosts:
113
 
114
- | Scenario | Detection | Coverage | CVSS Accuracy | FP | Final Score |
115
- |----------|-----------|----------|---------------|----|-------------|
116
- | Easy | 1.00 | 1.00 | 1.00 | 0 | **1.00** |
117
- | Medium | 0.67 | 1.00 | 1.00 | 1 | **0.85** |
118
- | Hard | 0.30 | 1.00 | 1.00 | 1 | **0.59** |
119
 
120
- The deterministic baseline achieves full coverage (discovers all hosts via pivoting) but only finds a fraction of vulnerabilities on medium/hard because chained vulns require multi-step reasoning and the step budget is tight. An LLM agent that reasons about attack chains should outperform this baseline.
121
 
122
  ## Scoring
123
 
 
99
  ## Tasks (3 Scenarios)
100
 
101
  ### Easy: Startup Web App Audit
102
+ 2 hosts, 3 vulnerabilities (SQLi, default credentials, exposed database). **Labeled tool output** tools report vulnerability type, CWE, CVSS, and remediation. Max 30 steps.
103
 
104
  ### Medium: E-commerce Platform Audit
105
+ 4 hosts (2 initially hidden behind firewall), 6 vulnerabilities (SSRF, IDOR, hardcoded secrets, unauthenticated Jenkins, weak credentials, outdated TLS). **Evidence-based output** — tools show anomalous behavior and raw evidence but do NOT label the vulnerability type, CWE, or severity. Agent must classify from evidence. SSRF discovery reveals internal hosts. Attack chaining required. Max 50 steps.
106
 
107
  ### Hard: Enterprise SOC2 Pre-Audit
108
+ 6 hosts (3 initially hidden on internal network), 10 vulnerabilities (stored XSS, BOLA, race condition, SSTI, file upload, weak creds, missing encryption, email misconfiguration, vulnerable component, missing rate limiting). **Raw tool output** — tools return HTTP responses, timing data, error messages, and protocol traces. No labels, no hints. Agent must infer vulnerability type, severity, CWE, and impact from raw evidence. Includes honeypot decoy. Progressive network discovery. Max 60 steps.
109
+
110
+ ## Tool Output Difficulty Tiers
111
+
112
+ The same tools produce different output detail depending on scenario difficulty:
113
+
114
+ | Difficulty | Tool Output Style | Agent Must... |
115
+ |------------|-------------------|---------------|
116
+ | Easy | `[CRITICAL] SQL Injection DETECTED, CWE: CWE-89, CVSS: 9.8` | Read and submit the labeled finding |
117
+ | Medium | `[!] Anomalous response — server fetched internal URL via image_url parameter` | Classify the vulnerability type and assess severity |
118
+ | Hard | `Parameter: image_url=http://10.0.2.30:8080 → HTTP 200, body: Jenkins HTML` | Infer SSRF from raw HTTP behavior, determine CWE-918, estimate CVSS |
119
+
120
+ This three-tier system ensures easy validates environment mechanics, medium tests classification ability, and hard genuinely challenges frontier model reasoning.
121
 
122
  ## Baseline Scores
123
 
124
+ Scores from a deterministic rule-based agent (no LLM) that scans, crawls endpoints, tests each individually, and attempts to parse tool output for labeled detections:
125
 
126
+ | Scenario | Detection | Coverage | Final Score | Why |
127
+ |----------|-----------|----------|-------------|-----|
128
+ | Easy | 1.00 | 1.00 | **1.00** | Labeled output parser matches perfectly |
129
+ | Medium | 0.00 | 0.50 | **0.07** | Evidence-based output — parser can't classify, only gets coverage |
130
+ | Hard | 0.00 | 0.40 | **0.00** | Raw output + honeypot penalty exceeds coverage score |
131
 
132
+ The deterministic baseline fails on medium/hard because raw tool output requires reasoning to classify vulnerabilities. An LLM agent that can infer "server fetched internal URL SSRF" or "payload {{7*7}} returned 49 SSTI" would significantly outperform this baseline.
133
 
134
  ## Scoring
135
 
server/tools.py CHANGED
@@ -4,6 +4,11 @@ Simulated Security Audit Tools.
4
  Each tool reads from the scenario's ground truth data and returns
5
  realistic-looking text output. No real network scanning occurs.
6
  All outputs are deterministic — same inputs always produce same results.
 
 
 
 
 
7
  """
8
 
9
  from typing import Any, Dict, List, Optional, Set, Tuple
@@ -64,11 +69,7 @@ TOOL_DEFINITIONS = [
64
 
65
 
66
  def _filter_by_chain(vulns: List[Dict], discovered_vulns: Optional[Set[str]]) -> List[Dict]:
67
- """Filter vulnerabilities by chain prerequisites.
68
-
69
- A vuln is visible if it has no requires_found field,
70
- or all prerequisites are in discovered_vulns.
71
- """
72
  if discovered_vulns is None:
73
  discovered_vulns = set()
74
  return [
@@ -86,6 +87,353 @@ _PORT_WARNINGS = {
86
  }
87
 
88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  def execute_tool(
90
  tool_name: str,
91
  arguments: Dict[str, Any],
@@ -231,6 +579,7 @@ def _vulnerability_scan(args, scenario, discovered_hosts, discovered_ports, disc
231
  vulns = [v for v in scenario.get("vulnerabilities", [])
232
  if v["host"] == host and "vulnerability_scan" in v.get("discoverable_by", [])]
233
  vulns = _filter_by_chain(vulns, discovered_vulns)
 
234
 
235
  lines = [f"Vulnerability scan for {host}:", ""]
236
  reported = set()
@@ -239,9 +588,18 @@ def _vulnerability_scan(args, scenario, discovered_hosts, discovered_ports, disc
239
  found_any = False
240
  for v in vulns:
241
  if v["id"] not in reported:
242
- lines.append(f" [!] VULNERABLE: {v['type']} (CVSS {v['cvss']}) — {v['cwe']}")
243
- lines.append(f" Confidence: High | Source: NVD/OWASP")
244
- lines.append(f" Remediation: {v['remediation']}")
 
 
 
 
 
 
 
 
 
245
  reported.add(v["id"])
246
  found_any = True
247
  if not found_any:
@@ -267,22 +625,8 @@ def _test_injection(args, scenario, discovered_hosts, discovered_ports, discover
267
  if vulns:
268
  lines = [f"Injection testing on {host}{endpoint}:", ""]
269
  for v in vulns:
270
- payload = {"SQL Injection": "' OR 1=1--", "SSRF": "http://169.254.169.254/latest/meta-data/",
271
- "SSTI": "{{7*7}}", "Business Logic": "concurrent request replay"}.get(
272
- v["type"].split("(")[0].strip(), "fuzz pattern set A-F")
273
- lines.extend([
274
- f" [CRITICAL] {v['type']} DETECTED",
275
- f" CWE: {v['cwe']}",
276
- f" OWASP: {v['owasp']}",
277
- f" Exploitation Steps:",
278
- f" 1. Identified injectable parameter in {endpoint}",
279
- f" 2. Confirmed with payload: {payload}",
280
- f" 3. Verified server response indicates vulnerability",
281
- f" Evidence: {v['evidence']}",
282
- f" Suggested CVSS: {v['cvss']} ({v['severity']})",
283
- f" Remediation: {v['remediation']}",
284
- "",
285
- ])
286
  return "\n".join(lines), [], {}, 0.08
287
  else:
288
  return (f"Injection testing on {host}{endpoint}:\n"
@@ -306,15 +650,8 @@ def _test_xss(args, scenario, discovered_hosts, discovered_ports, discovered_vul
306
  if vulns:
307
  lines = [f"XSS testing on {host}{endpoint}:", ""]
308
  for v in vulns:
309
- lines.extend([
310
- f" [{v['severity'].upper()}] {v['type']} DETECTED",
311
- f" CWE: {v['cwe']}",
312
- f" OWASP: {v['owasp']}",
313
- f" Evidence: {v['evidence']}",
314
- f" Suggested CVSS: {v['cvss']} ({v['severity']})",
315
- f" Remediation: {v['remediation']}",
316
- "",
317
- ])
318
  return "\n".join(lines), [], {}, 0.08
319
  else:
320
  return (f"XSS testing on {host}{endpoint}:\n"
@@ -338,15 +675,8 @@ def _test_auth(args, scenario, discovered_hosts, discovered_ports, discovered_vu
338
  if vulns:
339
  lines = [f"Auth & access control testing on {host}" + (endpoint or "") + ":", ""]
340
  for v in vulns:
341
- lines.extend([
342
- f" [ALERT] {v['type']} DETECTED",
343
- f" CWE: {v['cwe']}",
344
- f" OWASP: {v['owasp']}",
345
- f" Evidence: {v['evidence']}",
346
- f" Suggested CVSS: {v['cvss']} ({v['severity']})",
347
- f" Remediation: {v['remediation']}",
348
- "",
349
- ])
350
  return "\n".join(lines), [], {}, 0.08
351
  else:
352
  target = f"{host}{endpoint}" if endpoint else host
@@ -371,15 +701,8 @@ def _test_config(args, scenario, discovered_hosts, discovered_ports, discovered_
371
  if vulns:
372
  lines = [f"Configuration audit for {host}:", ""]
373
  for v in vulns:
374
- lines.extend([
375
- f" [MISCONFIGURATION] {v['type']}",
376
- f" CWE: {v['cwe']}",
377
- f" OWASP: {v['owasp']}",
378
- f" Evidence: {v['evidence']}",
379
- f" Suggested CVSS: {v['cvss']} ({v['severity']})",
380
- f" Remediation: {v['remediation']}",
381
- "",
382
- ])
383
  return "\n".join(lines), [], {}, 0.08
384
  else:
385
  return (f"Configuration audit for {host}:\n"
@@ -403,15 +726,8 @@ def _test_crypto(args, scenario, discovered_hosts, discovered_ports, discovered_
403
  if vulns:
404
  lines = [f"Cryptographic analysis for {host}:", ""]
405
  for v in vulns:
406
- lines.extend([
407
- f" [CRYPTO ISSUE] {v['type']}",
408
- f" CWE: {v['cwe']}",
409
- f" OWASP: {v['owasp']}",
410
- f" Evidence: {v['evidence']}",
411
- f" Suggested CVSS: {v['cvss']} ({v['severity']})",
412
- f" Remediation: {v['remediation']}",
413
- "",
414
- ])
415
  return "\n".join(lines), [], {}, 0.06
416
  else:
417
  return (f"Cryptographic analysis for {host}:\n"
@@ -437,15 +753,8 @@ def _check_secrets(args, scenario, discovered_hosts, discovered_ports, discovere
437
  if vulns:
438
  lines = [f"Secret scanning on {host}" + (endpoint or "") + ":", ""]
439
  for v in vulns:
440
- lines.extend([
441
- f" [SECRET EXPOSED] {v['type']}",
442
- f" CWE: {v['cwe']}",
443
- f" OWASP: {v['owasp']}",
444
- f" Evidence: {v['evidence']}",
445
- f" Suggested CVSS: {v['cvss']} ({v['severity']})",
446
- f" Remediation: {v['remediation']}",
447
- "",
448
- ])
449
  return "\n".join(lines), [], {}, 0.08
450
  else:
451
  target = f"{host}{endpoint}" if endpoint else host
 
4
  Each tool reads from the scenario's ground truth data and returns
5
  realistic-looking text output. No real network scanning occurs.
6
  All outputs are deterministic — same inputs always produce same results.
7
+
8
+ Output detail varies by scenario difficulty:
9
+ - Easy: labeled findings with CWE, CVSS, remediation (training mode)
10
+ - Medium: raw evidence with hints — agent must classify and assess
11
+ - Hard: raw tool output only — agent must infer everything from evidence
12
  """
13
 
14
  from typing import Any, Dict, List, Optional, Set, Tuple
 
69
 
70
 
71
  def _filter_by_chain(vulns: List[Dict], discovered_vulns: Optional[Set[str]]) -> List[Dict]:
72
+ """Filter vulnerabilities by chain prerequisites."""
 
 
 
 
73
  if discovered_vulns is None:
74
  discovered_vulns = set()
75
  return [
 
87
  }
88
 
89
 
90
+ # ---------------------------------------------------------------------------
91
+ # Raw output templates for medium/hard (keyed by vuln ID)
92
+ # Medium: shows evidence + partial context, no CWE/CVSS/remediation
93
+ # Hard: raw tool output — HTTP responses, timing, errors — agent infers
94
+ # ---------------------------------------------------------------------------
95
+ _RAW_OUTPUTS = {
96
+ # ---- MEDIUM: E-commerce Platform ----
97
+ "VULN-M001": {
98
+ "medium": (
99
+ " [!] Anomalous response from server\n"
100
+ " Parameter: image_url\n"
101
+ " Payload: image_url=http://10.0.2.30:8080\n"
102
+ " Response: HTTP 200 — body contains HTML from internal Jenkins server\n"
103
+ " Payload: image_url=http://169.254.169.254/latest/meta-data/\n"
104
+ " Response: HTTP 200 — body contains AWS metadata (ami-id, instance-type)\n"
105
+ " The server fetched and returned content from internal/cloud URLs via the image_url parameter.\n"
106
+ " Assess the vulnerability type, severity, and impact."
107
+ ),
108
+ "hard": (
109
+ " Parameter: image_url | Payload: http://10.0.2.30:8080\n"
110
+ " Response: HTTP 200\n"
111
+ ' Body (first 200 chars): <!DOCTYPE html><html><head><title>Dashboard [Jenkins]</title><link rel="stylesheet" href="/static/css/jenkins...\n'
112
+ "\n"
113
+ " Parameter: image_url | Payload: http://169.254.169.254/latest/meta-data/\n"
114
+ " Response: HTTP 200\n"
115
+ " Body: ami-id\\ninstance-type\\nlocal-ipv4\\nplacement/availability-zone\n"
116
+ "\n"
117
+ " Parameter: image_url | Payload: http://10.0.2.40:5432\n"
118
+ " Response: HTTP 200 (non-HTTP binary data in body, connection established)\n"
119
+ "\n"
120
+ " Parameter: image_url | Payload: https://example.com/image.png\n"
121
+ " Response: HTTP 200 (valid image returned as expected)"
122
+ ),
123
+ },
124
+ "VULN-M002": {
125
+ "medium": (
126
+ " [!] Authorization bypass detected\n"
127
+ " Authenticated as user_id=100\n"
128
+ ' GET /api/v1/orders/5001 (own) → HTTP 200 {"order_id":5001,"user_id":100,"total":"$149.99"}\n'
129
+ ' GET /api/v1/orders/5002 (other) → HTTP 200 {"order_id":5002,"user_id":101,"email":"other@shop.com","card_last4":"7823"}\n'
130
+ " Other users' PII and payment data accessible by changing the ID parameter.\n"
131
+ " Assess the vulnerability type, severity, and impact."
132
+ ),
133
+ "hard": (
134
+ " Authenticated as user_id=100 (session token: eyJhbGciOiJIUz...)\n"
135
+ ' GET /api/v1/orders/5001 → HTTP 200 {"order_id":5001,"user_id":100,"total":"$149.99","items":[...]}\n'
136
+ ' GET /api/v1/orders/5002 → HTTP 200 {"order_id":5002,"user_id":101,"email":"other@shop.com","card_last4":"7823","total":"$89.50"}\n'
137
+ ' GET /api/v1/orders/5003 → HTTP 200 {"order_id":5003,"user_id":102,"email":"bob@shop.com","card_last4":"1234","total":"$210.00"}\n'
138
+ " GET /api/v1/orders/9999 → HTTP 404\n"
139
+ " Note: Same session token used for all requests. No authorization header variation."
140
+ ),
141
+ },
142
+ "VULN-M003": {
143
+ "medium": (
144
+ " [!] Sensitive data found in client-side code\n"
145
+ " File: /static/js/app.bundle.js (847KB, minified)\n"
146
+ ' Line 1847: const STRIPE_KEY = "sk_live_51HG3jK2eZvKYlo2C0EXAMPLE_REDACTED";\n'
147
+ ' Line 2103: const API_INTERNAL = "http://10.0.2.20:8080/api/v1";\n'
148
+ " Line 2210: // TODO: remove hardcoded keys before production\n"
149
+ " Hardcoded API keys and internal URLs found in publicly accessible JavaScript.\n"
150
+ " Assess the vulnerability type, severity, and impact."
151
+ ),
152
+ "hard": (
153
+ " Analyzing /static/js/app.bundle.js (847KB, minified)\n"
154
+ "\n"
155
+ " High-entropy strings detected:\n"
156
+ ' Line 1847: const STRIPE_KEY = "sk_live_51HG3jK2eZvKYlo2C0EXAMPLE_REDACTED";\n'
157
+ ' Line 2103: const API_INTERNAL = "http://10.0.2.20:8080/api/v1";\n'
158
+ " Line 2210: // TODO: remove hardcoded keys before production\n"
159
+ ' Line 3891: const JWT_SECRET = "shopfast-jwt-secret-2024";\n'
160
+ "\n"
161
+ ' Shannon entropy analysis:\n'
162
+ ' "sk_live_51HG3jK..." — entropy: 5.2 bits/char\n'
163
+ ' "shopfast-jwt-secret..." — entropy: 3.8 bits/char\n'
164
+ " Total: 4 suspicious strings across 1 file"
165
+ ),
166
+ },
167
+ "VULN-M004": {
168
+ "medium": (
169
+ " [!] Service accessible without authentication\n"
170
+ " GET / → HTTP 200 (Jenkins 2.235.1 dashboard rendered)\n"
171
+ " GET /script → HTTP 200 (Groovy script console accessible)\n"
172
+ " GET /manage → HTTP 200 (management interface rendered)\n"
173
+ " No authentication challenge (WWW-Authenticate header absent).\n"
174
+ " CI/CD system with script execution capability is fully open.\n"
175
+ " Assess the vulnerability type, severity, and impact."
176
+ ),
177
+ "hard": (
178
+ " GET / → HTTP 200\n"
179
+ " Headers: X-Jenkins: 2.235.1, X-Hudson: 1.395\n"
180
+ ' Body (excerpt): <div id="main-panel">...Welcome to Jenkins!...2 jobs configured...\n'
181
+ "\n"
182
+ " GET /script → HTTP 200\n"
183
+ ' Body (excerpt): <textarea name="script" id="script">...</textarea>\n'
184
+ " Note: No redirect to /login, no 401/403 status\n"
185
+ "\n"
186
+ " GET /manage → HTTP 200\n"
187
+ ' Body (excerpt): ...Manage Jenkins...Configure System...Manage Plugins...\n'
188
+ "\n"
189
+ " GET /credentials → HTTP 200\n"
190
+ ' Body (excerpt): ...Stored credentials: 3 items...\n'
191
+ "\n"
192
+ " Cookie: no JSESSIONID set (no session management active)"
193
+ ),
194
+ },
195
+ "VULN-M005": {
196
+ "medium": (
197
+ " [!] Database accepts weak credentials\n"
198
+ " PostgreSQL (port 5432):\n"
199
+ " dbuser:dbpass123 → authentication successful\n"
200
+ " Connected — 23 tables visible, no connection rate limiting detected\n"
201
+ " Assess the vulnerability type, severity, and impact."
202
+ ),
203
+ "hard": (
204
+ " PostgreSQL (port 5432):\n"
205
+ " postgres:postgres → FATAL: password authentication failed\n"
206
+ " admin:admin → FATAL: password authentication failed\n"
207
+ " dbuser:dbuser → FATAL: password authentication failed\n"
208
+ " dbuser:dbpass123 → connection established\n"
209
+ "\n"
210
+ ' SELECT current_user; → "dbuser"\n'
211
+ " \\dt → 23 tables (users, orders, payments, sessions, ...)\n"
212
+ " SELECT count(*) FROM users; → 12,847\n"
213
+ "\n"
214
+ " Brute force timing: 50 attempts in 3.1s, no lockout or delay increase"
215
+ ),
216
+ },
217
+ "VULN-M006": {
218
+ "medium": (
219
+ " [!] Outdated TLS configuration detected\n"
220
+ " Server accepts TLSv1.0 and TLSv1.1 (deprecated protocols)\n"
221
+ " Weak cipher suite accepted: TLS_RSA_WITH_3DES_EDE_CBC_SHA\n"
222
+ " HSTS header not present in responses.\n"
223
+ " Assess the vulnerability type, severity, and impact."
224
+ ),
225
+ "hard": (
226
+ " TLS handshake analysis:\n"
227
+ " Protocol: TLSv1.0 → ServerHello (accepted)\n"
228
+ " Protocol: TLSv1.1 → ServerHello (accepted)\n"
229
+ " Protocol: TLSv1.2 → ServerHello (accepted)\n"
230
+ " Protocol: TLSv1.3 → handshake failure\n"
231
+ "\n"
232
+ " Cipher suites (port 443):\n"
233
+ " TLS_RSA_WITH_3DES_EDE_CBC_SHA → accepted\n"
234
+ " TLS_RSA_WITH_AES_128_CBC_SHA → accepted\n"
235
+ " TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 → accepted\n"
236
+ "\n"
237
+ " Certificate: CN=shopfast.com, expires 2025-12-01, issuer=Let's Encrypt\n"
238
+ " HSTS: not present\n"
239
+ " OCSP stapling: not enabled"
240
+ ),
241
+ },
242
+ # ---- HARD: Enterprise SOC2 Pre-Audit ----
243
+ "VULN-H001": {
244
+ "hard": (
245
+ " Parameter: message\n"
246
+ " Payload: <img src=x onerror=alert(1)>\n"
247
+ " POST /support/ticket → HTTP 200\n"
248
+ "\n"
249
+ " Verification: GET /support/tickets?latest=true\n"
250
+ ' Response body excerpt:\n'
251
+ ' <div class="ticket-body"><img src=x onerror=alert(1)></div>\n'
252
+ " Note: Payload persisted in database and rendered without encoding\n"
253
+ " Note: Response headers — no Content-Security-Policy, X-XSS-Protection: 0"
254
+ ),
255
+ },
256
+ "VULN-H002": {
257
+ "hard": (
258
+ " Authenticated as user_id=1001 (Bearer token: eyJhbG...)\n"
259
+ ' GET /api/v2/accounts/1001 → HTTP 200 {"account_id":1001,"name":"Test User","balance":"$12,450.00"}\n'
260
+ ' GET /api/v2/accounts/1002 → HTTP 200 {"account_id":1002,"name":"Jane Smith","balance":"$87,230.50","ssn_last4":"4829"}\n'
261
+ ' GET /api/v2/accounts/1003 → HTTP 200 {"account_id":1003,"name":"Bob Jones","balance":"$3,100.00","ssn_last4":"7712"}\n'
262
+ " GET /api/v2/accounts/9999 → HTTP 404\n"
263
+ "\n"
264
+ " Note: Sequential integer IDs. Same Bearer token used for all requests.\n"
265
+ " Note: PII (SSN, balance) exposed in cross-account responses."
266
+ ),
267
+ },
268
+ "VULN-H003": {
269
+ "hard": (
270
+ " Concurrency testing on /api/v2/transfer:\n"
271
+ " Initial balance (account 1001): $1,000.00\n"
272
+ "\n"
273
+ " Sent 5 concurrent POST requests:\n"
274
+ ' {"from_account":"1001","to_account":"1002","amount":800}\n'
275
+ "\n"
276
+ ' Request 1: HTTP 200 {"status":"success","new_balance":"$200.00"}\n'
277
+ ' Request 2: HTTP 200 {"status":"success","new_balance":"$200.00"}\n'
278
+ ' Request 3: HTTP 200 {"status":"success","new_balance":"$200.00"}\n'
279
+ ' Request 4: HTTP 402 {"error":"insufficient_funds"}\n'
280
+ ' Request 5: HTTP 402 {"error":"insufficient_funds"}\n'
281
+ "\n"
282
+ " Final balance (account 1001): -$1,400.00\n"
283
+ " Total transferred: $2,400.00 from initial $1,000.00\n"
284
+ " Note: Balance check and deduction are not atomic operations"
285
+ ),
286
+ },
287
+ "VULN-H004": {
288
+ "hard": (
289
+ " Parameter: template\n"
290
+ "\n"
291
+ " Payload: {{7*7}}\n"
292
+ ' Response body (excerpt): ...Report generated: 49...\n'
293
+ "\n"
294
+ " Payload: {{config}}\n"
295
+ " Response body (excerpt): ...Report generated: <Config {'ENV': 'production', "
296
+ "'DEBUG': False, 'SECRET_KEY': 'finsecure-prod-key-2024'}>...\n"
297
+ "\n"
298
+ ' Payload: {{"test".__class__.__mro__}}\n'
299
+ " Response body (excerpt): ...(<class 'str'>, <class 'object'>)...\n"
300
+ "\n"
301
+ " Server headers: X-Powered-By: Flask/Jinja2"
302
+ ),
303
+ },
304
+ "VULN-H005": {
305
+ "hard": (
306
+ " File upload testing:\n"
307
+ ' test.txt (text/plain, 12B) → HTTP 200 {"path":"/uploads/test.txt"}\n'
308
+ ' test.php (application/x-php, 45B) → HTTP 200 {"path":"/uploads/test.php"}\n'
309
+ ' test.jsp (text/plain, 38B) → HTTP 200 {"path":"/uploads/test.jsp"}\n'
310
+ ' test.exe (application/octet-stream, 1KB) → HTTP 200 {"path":"/uploads/test.exe"}\n'
311
+ "\n"
312
+ " Verification: GET /uploads/test.php\n"
313
+ " Response: HTTP 200, Content-Type: text/html\n"
314
+ " Body: PHP output rendered (file was executed server-side, not served as static)"
315
+ ),
316
+ },
317
+ "VULN-H006": {
318
+ "hard": (
319
+ " FTP (port 21):\n"
320
+ " USER anonymous → 331 Please specify the password\n"
321
+ " PASS anonymous → 230 Login successful\n"
322
+ " LIST → drwxr-xr-x /reports/\n"
323
+ " drwxr-xr-x /employee_data/\n"
324
+ " drwxr-xr-x /financial/\n"
325
+ " SIZE /financial/Q4_2024_revenue.xlsx → 2,371,584 bytes\n"
326
+ "\n"
327
+ " SMB (port 445):\n"
328
+ " Guest session → \\\\10.0.3.40\\Documents accessible (read/write)\n"
329
+ " Files: employee_roster.csv, salary_data.xlsx, board_minutes_2024.pdf\n"
330
+ " Share permissions: Everyone:FullControl"
331
+ ),
332
+ },
333
+ "VULN-H007": {
334
+ "hard": (
335
+ " FTP (port 21):\n"
336
+ " AUTH TLS → 502 Command not implemented\n"
337
+ " FEAT → no TLS/SSL features advertised\n"
338
+ "\n"
339
+ " Captured transfer: RETR /financial/Q4_2024_revenue.xlsx\n"
340
+ " Transfer protocol: plaintext FTP (no encryption layer)\n"
341
+ " Wireshark excerpt: PASS command visible in cleartext in packet capture\n"
342
+ "\n"
343
+ " SFTP (port 22): connection refused\n"
344
+ " FTPS: not available"
345
+ ),
346
+ },
347
+ "VULN-H008": {
348
+ "hard": (
349
+ " SMTP (port 25, Postfix 3.4.13):\n"
350
+ " EHLO test.local → 250 mail.finsecure.com\n"
351
+ " MAIL FROM:<attacker@external.com> → 250 2.1.0 Ok\n"
352
+ " RCPT TO:<ceo@finsecure.com> → 250 2.1.5 Ok\n"
353
+ " DATA → 354 End data with <CR><LF>.<CR><LF>\n"
354
+ " (message accepted from unauthenticated external sender)\n"
355
+ "\n"
356
+ " DNS records for finsecure.com:\n"
357
+ " SPF: no record found\n"
358
+ " DKIM: no record found\n"
359
+ " DMARC: no record found\n"
360
+ " MX: mail.finsecure.com → 10.0.3.50"
361
+ ),
362
+ },
363
+ "VULN-H009": {
364
+ "hard": (
365
+ " Port 8080: Apache Tomcat/9.0.31\n"
366
+ "\n"
367
+ " NVD database results:\n"
368
+ " CVE-2020-1938 — Base Score: 9.8 — AJP connector file read/inclusion\n"
369
+ " CVE-2020-9484 — Base Score: 7.0 — session persistence deserialization\n"
370
+ " CVE-2020-11996 — Base Score: 7.5 — HTTP/2 request handling DoS\n"
371
+ "\n"
372
+ " Port 8009/tcp: open (AJP13 protocol response received)\n"
373
+ " Note: AJP connector externally accessible from scan origin"
374
+ ),
375
+ },
376
+ "VULN-H010": {
377
+ "hard": (
378
+ " POST /login — rate limit testing:\n"
379
+ " Batch 1: 100 requests with invalid creds in 2.3s\n"
380
+ " Result: 100x HTTP 401 (all processed, none blocked)\n"
381
+ " Avg response time: 23ms (consistent, no throttling)\n"
382
+ "\n"
383
+ " Batch 2: 1000 requests in 18.7s\n"
384
+ " Result: 1000x HTTP 401\n"
385
+ " No CAPTCHA challenge\n"
386
+ " No account lockout triggered\n"
387
+ " No IP-based blocking detected\n"
388
+ "\n"
389
+ " Response headers: no X-RateLimit-* headers present"
390
+ ),
391
+ },
392
+ }
393
+
394
+
395
+ def _labeled_output(v: Dict) -> str:
396
+ """Easy mode: full labeled finding with CWE, CVSS, remediation."""
397
+ tag = {
398
+ "Critical": "CRITICAL", "High": "ALERT",
399
+ "Medium": "WARNING", "Low": "INFO",
400
+ }.get(v["severity"], "FINDING")
401
+ return (
402
+ f" [{tag}] {v['type']} DETECTED\n"
403
+ f" CWE: {v['cwe']}\n"
404
+ f" OWASP: {v['owasp']}\n"
405
+ f" Evidence: {v['evidence']}\n"
406
+ f" Suggested CVSS: {v['cvss']} ({v['severity']})\n"
407
+ f" Remediation: {v['remediation']}"
408
+ )
409
+
410
+
411
+ def _format_detection(v: Dict, scenario: Dict, header: str) -> str:
412
+ """Pick output format based on scenario difficulty."""
413
+ difficulty = scenario.get("id", "easy")
414
+
415
+ if difficulty == "easy":
416
+ return _labeled_output(v)
417
+
418
+ raw = _RAW_OUTPUTS.get(v["id"], {}).get(difficulty)
419
+ if raw:
420
+ return raw
421
+
422
+ # Fallback for medium vulns without explicit raw output
423
+ if difficulty == "medium":
424
+ return (
425
+ f" [!] Potential issue detected\n"
426
+ f" Observation: {v['evidence']}\n"
427
+ f" Assess the vulnerability type, severity, and impact."
428
+ )
429
+
430
+ # Fallback for hard vulns without explicit raw output
431
+ return (
432
+ f" Observation: {v['evidence']}\n"
433
+ f" No further classification available — manual analysis required."
434
+ )
435
+
436
+
437
  def execute_tool(
438
  tool_name: str,
439
  arguments: Dict[str, Any],
 
579
  vulns = [v for v in scenario.get("vulnerabilities", [])
580
  if v["host"] == host and "vulnerability_scan" in v.get("discoverable_by", [])]
581
  vulns = _filter_by_chain(vulns, discovered_vulns)
582
+ difficulty = scenario.get("id", "easy")
583
 
584
  lines = [f"Vulnerability scan for {host}:", ""]
585
  reported = set()
 
588
  found_any = False
589
  for v in vulns:
590
  if v["id"] not in reported:
591
+ if difficulty == "easy":
592
+ lines.append(f" [!] VULNERABLE: {v['type']} (CVSS {v['cvss']}) — {v['cwe']}")
593
+ lines.append(f" Confidence: High | Source: NVD/OWASP")
594
+ lines.append(f" Remediation: {v['remediation']}")
595
+ else:
596
+ raw = _RAW_OUTPUTS.get(v["id"], {}).get(difficulty)
597
+ if raw:
598
+ lines.append(raw)
599
+ else:
600
+ # Generic CVE-style output without explicit classification
601
+ lines.append(f" NVD match for {p['version']}")
602
+ lines.append(f" See output details for assessment.")
603
  reported.add(v["id"])
604
  found_any = True
605
  if not found_any:
 
625
  if vulns:
626
  lines = [f"Injection testing on {host}{endpoint}:", ""]
627
  for v in vulns:
628
+ lines.append(_format_detection(v, scenario, f"Injection testing on {host}{endpoint}"))
629
+ lines.append("")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
630
  return "\n".join(lines), [], {}, 0.08
631
  else:
632
  return (f"Injection testing on {host}{endpoint}:\n"
 
650
  if vulns:
651
  lines = [f"XSS testing on {host}{endpoint}:", ""]
652
  for v in vulns:
653
+ lines.append(_format_detection(v, scenario, f"XSS testing on {host}{endpoint}"))
654
+ lines.append("")
 
 
 
 
 
 
 
655
  return "\n".join(lines), [], {}, 0.08
656
  else:
657
  return (f"XSS testing on {host}{endpoint}:\n"
 
675
  if vulns:
676
  lines = [f"Auth & access control testing on {host}" + (endpoint or "") + ":", ""]
677
  for v in vulns:
678
+ lines.append(_format_detection(v, scenario, f"Auth testing on {host}"))
679
+ lines.append("")
 
 
 
 
 
 
 
680
  return "\n".join(lines), [], {}, 0.08
681
  else:
682
  target = f"{host}{endpoint}" if endpoint else host
 
701
  if vulns:
702
  lines = [f"Configuration audit for {host}:", ""]
703
  for v in vulns:
704
+ lines.append(_format_detection(v, scenario, f"Configuration audit for {host}"))
705
+ lines.append("")
 
 
 
 
 
 
 
706
  return "\n".join(lines), [], {}, 0.08
707
  else:
708
  return (f"Configuration audit for {host}:\n"
 
726
  if vulns:
727
  lines = [f"Cryptographic analysis for {host}:", ""]
728
  for v in vulns:
729
+ lines.append(_format_detection(v, scenario, f"Cryptographic analysis for {host}"))
730
+ lines.append("")
 
 
 
 
 
 
 
731
  return "\n".join(lines), [], {}, 0.06
732
  else:
733
  return (f"Cryptographic analysis for {host}:\n"
 
753
  if vulns:
754
  lines = [f"Secret scanning on {host}" + (endpoint or "") + ":", ""]
755
  for v in vulns:
756
+ lines.append(_format_detection(v, scenario, f"Secret scanning on {host}"))
757
+ lines.append("")
 
 
 
 
 
 
 
758
  return "\n".join(lines), [], {}, 0.08
759
  else:
760
  target = f"{host}{endpoint}" if endpoint else host