Spaces:

anshumanatrey
/

security-audit-env

Sleeping

App Files Files Community

anshumanatrey commited on 21 days ago

Commit

4057030

verified ·

1 Parent(s): 97aee49

Upload folder using huggingface_hub

Browse files

Files changed (2) hide show

README.md +22 -10
server/tools.py +378 -69

README.md CHANGED Viewed

@@ -99,25 +99,37 @@ with SecurityAuditEnv(base_url="http://localhost:8000").sync() as env:
 ## Tasks (3 Scenarios)
 ### Easy: Startup Web App Audit
-2 hosts, 3 vulnerabilities (SQLi, default credentials, exposed database). All discoverable with basic scans. Max 30 steps.
 ### Medium: E-commerce Platform Audit
-4 hosts (2 initially hidden behind firewall), 6 vulnerabilities (SSRF, IDOR, hardcoded secrets, unauthenticated Jenkins, weak credentials, outdated TLS). SSRF discovery reveals internal hosts. Attack chaining required. Max 50 steps.
 ### Hard: Enterprise SOC2 Pre-Audit
-6 hosts (3 initially hidden on internal network), 10 vulnerabilities (stored XSS, BOLA, race condition, SSTI, file upload, weak creds, missing encryption, email misconfiguration, vulnerable component, missing rate limiting). Includes honeypot decoy. Progressive network discovery — compromise external hosts to pivot to internal network. Max 60 steps.
 ## Baseline Scores
-Scores from a deterministic audit agent (no LLM) that scans, crawls endpoints, tests each individually, parses output for detections, submits findings, and pivots through discovered vulns to unlock hidden hosts:
-| Scenario | Detection | Coverage | CVSS Accuracy | FP | Final Score |
-|----------|-----------|----------|---------------|----|-------------|
-| Easy | 1.00 | 1.00 | 1.00 | 0 | **1.00** |
-| Medium | 0.67 | 1.00 | 1.00 | 1 | **0.85** |
-| Hard | 0.30 | 1.00 | 1.00 | 1 | **0.59** |
-The deterministic baseline achieves full coverage (discovers all hosts via pivoting) but only finds a fraction of vulnerabilities on medium/hard because chained vulns require multi-step reasoning and the step budget is tight. An LLM agent that reasons about attack chains should outperform this baseline.
 ## Scoring

 ## Tasks (3 Scenarios)
 ### Easy: Startup Web App Audit
+2 hosts, 3 vulnerabilities (SQLi, default credentials, exposed database). **Labeled tool output** — tools report vulnerability type, CWE, CVSS, and remediation. Max 30 steps.
 ### Medium: E-commerce Platform Audit
+4 hosts (2 initially hidden behind firewall), 6 vulnerabilities (SSRF, IDOR, hardcoded secrets, unauthenticated Jenkins, weak credentials, outdated TLS). **Evidence-based output** — tools show anomalous behavior and raw evidence but do NOT label the vulnerability type, CWE, or severity. Agent must classify from evidence. SSRF discovery reveals internal hosts. Attack chaining required. Max 50 steps.
 ### Hard: Enterprise SOC2 Pre-Audit
+6 hosts (3 initially hidden on internal network), 10 vulnerabilities (stored XSS, BOLA, race condition, SSTI, file upload, weak creds, missing encryption, email misconfiguration, vulnerable component, missing rate limiting). **Raw tool output** — tools return HTTP responses, timing data, error messages, and protocol traces. No labels, no hints. Agent must infer vulnerability type, severity, CWE, and impact from raw evidence. Includes honeypot decoy. Progressive network discovery. Max 60 steps.
+## Tool Output Difficulty Tiers
+The same tools produce different output detail depending on scenario difficulty:
+| Difficulty | Tool Output Style | Agent Must... |
+|------------|-------------------|---------------|
+| Easy | `[CRITICAL] SQL Injection DETECTED, CWE: CWE-89, CVSS: 9.8` | Read and submit the labeled finding |
+| Medium | `[!] Anomalous response — server fetched internal URL via image_url parameter` | Classify the vulnerability type and assess severity |
+| Hard | `Parameter: image_url=http://10.0.2.30:8080 → HTTP 200, body: Jenkins HTML` | Infer SSRF from raw HTTP behavior, determine CWE-918, estimate CVSS |
+This three-tier system ensures easy validates environment mechanics, medium tests classification ability, and hard genuinely challenges frontier model reasoning.
 ## Baseline Scores
+Scores from a deterministic rule-based agent (no LLM) that scans, crawls endpoints, tests each individually, and attempts to parse tool output for labeled detections:
+| Scenario | Detection | Coverage | Final Score | Why |
+|----------|-----------|----------|-------------|-----|
+| Easy | 1.00 | 1.00 | **1.00** | Labeled output — parser matches perfectly |
+| Medium | 0.00 | 0.50 | **0.07** | Evidence-based output — parser can't classify, only gets coverage |
+| Hard | 0.00 | 0.40 | **0.00** | Raw output + honeypot penalty exceeds coverage score |
+The deterministic baseline fails on medium/hard because raw tool output requires reasoning to classify vulnerabilities. An LLM agent that can infer "server fetched internal URL → SSRF" or "payload {{7*7}} returned 49 → SSTI" would significantly outperform this baseline.
 ## Scoring

server/tools.py CHANGED Viewed

@@ -4,6 +4,11 @@ Simulated Security Audit Tools.
 Each tool reads from the scenario's ground truth data and returns
 realistic-looking text output. No real network scanning occurs.
 All outputs are deterministic — same inputs always produce same results.
 """
 from typing import Any, Dict, List, Optional, Set, Tuple
@@ -64,11 +69,7 @@ TOOL_DEFINITIONS = [
 def _filter_by_chain(vulns: List[Dict], discovered_vulns: Optional[Set[str]]) -> List[Dict]:
-    """Filter vulnerabilities by chain prerequisites.
-    A vuln is visible if it has no requires_found field,
-    or all prerequisites are in discovered_vulns.
-    """
     if discovered_vulns is None:
         discovered_vulns = set()
     return [
@@ -86,6 +87,353 @@ _PORT_WARNINGS = {
 }
 def execute_tool(
     tool_name: str,
     arguments: Dict[str, Any],
@@ -231,6 +579,7 @@ def _vulnerability_scan(args, scenario, discovered_hosts, discovered_ports, disc
     vulns = [v for v in scenario.get("vulnerabilities", [])
              if v["host"] == host and "vulnerability_scan" in v.get("discoverable_by", [])]
     vulns = _filter_by_chain(vulns, discovered_vulns)
     lines = [f"Vulnerability scan for {host}:", ""]
     reported = set()
@@ -239,9 +588,18 @@ def _vulnerability_scan(args, scenario, discovered_hosts, discovered_ports, disc
         found_any = False
         for v in vulns:
             if v["id"] not in reported:
-                lines.append(f"    [!] VULNERABLE: {v['type']} (CVSS {v['cvss']}) — {v['cwe']}")
-                lines.append(f"        Confidence: High | Source: NVD/OWASP")
-                lines.append(f"        Remediation: {v['remediation']}")
                 reported.add(v["id"])
                 found_any = True
         if not found_any:
@@ -267,22 +625,8 @@ def _test_injection(args, scenario, discovered_hosts, discovered_ports, discover
     if vulns:
         lines = [f"Injection testing on {host}{endpoint}:", ""]
         for v in vulns:
-            payload = {"SQL Injection": "' OR 1=1--", "SSRF": "http://169.254.169.254/latest/meta-data/",
-                       "SSTI": "{{7*7}}", "Business Logic": "concurrent request replay"}.get(
-                v["type"].split("(")[0].strip(), "fuzz pattern set A-F")
-            lines.extend([
-                f"  [CRITICAL] {v['type']} DETECTED",
-                f"  CWE: {v['cwe']}",
-                f"  OWASP: {v['owasp']}",
-                f"  Exploitation Steps:",
-                f"    1. Identified injectable parameter in {endpoint}",
-                f"    2. Confirmed with payload: {payload}",
-                f"    3. Verified server response indicates vulnerability",
-                f"  Evidence: {v['evidence']}",
-                f"  Suggested CVSS: {v['cvss']} ({v['severity']})",
-                f"  Remediation: {v['remediation']}",
-                "",
-            ])
         return "\n".join(lines), [], {}, 0.08
     else:
         return (f"Injection testing on {host}{endpoint}:\n"
@@ -306,15 +650,8 @@ def _test_xss(args, scenario, discovered_hosts, discovered_ports, discovered_vul
     if vulns:
         lines = [f"XSS testing on {host}{endpoint}:", ""]
         for v in vulns:
-            lines.extend([
-                f"  [{v['severity'].upper()}] {v['type']} DETECTED",
-                f"  CWE: {v['cwe']}",
-                f"  OWASP: {v['owasp']}",
-                f"  Evidence: {v['evidence']}",
-                f"  Suggested CVSS: {v['cvss']} ({v['severity']})",
-                f"  Remediation: {v['remediation']}",
-                "",
-            ])
         return "\n".join(lines), [], {}, 0.08
     else:
         return (f"XSS testing on {host}{endpoint}:\n"
@@ -338,15 +675,8 @@ def _test_auth(args, scenario, discovered_hosts, discovered_ports, discovered_vu
     if vulns:
         lines = [f"Auth & access control testing on {host}" + (endpoint or "") + ":", ""]
         for v in vulns:
-            lines.extend([
-                f"  [ALERT] {v['type']} DETECTED",
-                f"  CWE: {v['cwe']}",
-                f"  OWASP: {v['owasp']}",
-                f"  Evidence: {v['evidence']}",
-                f"  Suggested CVSS: {v['cvss']} ({v['severity']})",
-                f"  Remediation: {v['remediation']}",
-                "",
-            ])
         return "\n".join(lines), [], {}, 0.08
     else:
         target = f"{host}{endpoint}" if endpoint else host
@@ -371,15 +701,8 @@ def _test_config(args, scenario, discovered_hosts, discovered_ports, discovered_
     if vulns:
         lines = [f"Configuration audit for {host}:", ""]
         for v in vulns:
-            lines.extend([
-                f"  [MISCONFIGURATION] {v['type']}",
-                f"  CWE: {v['cwe']}",
-                f"  OWASP: {v['owasp']}",
-                f"  Evidence: {v['evidence']}",
-                f"  Suggested CVSS: {v['cvss']} ({v['severity']})",
-                f"  Remediation: {v['remediation']}",
-                "",
-            ])
         return "\n".join(lines), [], {}, 0.08
     else:
         return (f"Configuration audit for {host}:\n"
@@ -403,15 +726,8 @@ def _test_crypto(args, scenario, discovered_hosts, discovered_ports, discovered_
     if vulns:
         lines = [f"Cryptographic analysis for {host}:", ""]
         for v in vulns:
-            lines.extend([
-                f"  [CRYPTO ISSUE] {v['type']}",
-                f"  CWE: {v['cwe']}",
-                f"  OWASP: {v['owasp']}",
-                f"  Evidence: {v['evidence']}",
-                f"  Suggested CVSS: {v['cvss']} ({v['severity']})",
-                f"  Remediation: {v['remediation']}",
-                "",
-            ])
         return "\n".join(lines), [], {}, 0.06
     else:
         return (f"Cryptographic analysis for {host}:\n"
@@ -437,15 +753,8 @@ def _check_secrets(args, scenario, discovered_hosts, discovered_ports, discovere
     if vulns:
         lines = [f"Secret scanning on {host}" + (endpoint or "") + ":", ""]
         for v in vulns:
-            lines.extend([
-                f"  [SECRET EXPOSED] {v['type']}",
-                f"  CWE: {v['cwe']}",
-                f"  OWASP: {v['owasp']}",
-                f"  Evidence: {v['evidence']}",
-                f"  Suggested CVSS: {v['cvss']} ({v['severity']})",
-                f"  Remediation: {v['remediation']}",
-                "",
-            ])
         return "\n".join(lines), [], {}, 0.08
     else:
         target = f"{host}{endpoint}" if endpoint else host

 Each tool reads from the scenario's ground truth data and returns
 realistic-looking text output. No real network scanning occurs.
 All outputs are deterministic — same inputs always produce same results.
+Output detail varies by scenario difficulty:
+- Easy: labeled findings with CWE, CVSS, remediation (training mode)
+- Medium: raw evidence with hints — agent must classify and assess
+- Hard: raw tool output only — agent must infer everything from evidence
 """
 from typing import Any, Dict, List, Optional, Set, Tuple
 def _filter_by_chain(vulns: List[Dict], discovered_vulns: Optional[Set[str]]) -> List[Dict]:
+    """Filter vulnerabilities by chain prerequisites."""
     if discovered_vulns is None:
         discovered_vulns = set()
     return [
 }
+# ---------------------------------------------------------------------------
+# Raw output templates for medium/hard (keyed by vuln ID)
+# Medium: shows evidence + partial context, no CWE/CVSS/remediation
+# Hard:   raw tool output — HTTP responses, timing, errors — agent infers
+# ---------------------------------------------------------------------------
+_RAW_OUTPUTS = {
+    # ---- MEDIUM: E-commerce Platform ----
+    "VULN-M001": {
+        "medium": (
+            "  [!] Anomalous response from server\n"
+            "  Parameter: image_url\n"
+            "  Payload: image_url=http://10.0.2.30:8080\n"
+            "  Response: HTTP 200 — body contains HTML from internal Jenkins server\n"
+            "  Payload: image_url=http://169.254.169.254/latest/meta-data/\n"
+            "  Response: HTTP 200 — body contains AWS metadata (ami-id, instance-type)\n"
+            "  The server fetched and returned content from internal/cloud URLs via the image_url parameter.\n"
+            "  Assess the vulnerability type, severity, and impact."
+        ),
+        "hard": (
+            "  Parameter: image_url | Payload: http://10.0.2.30:8080\n"
+            "  Response: HTTP 200\n"
+            '  Body (first 200 chars): <!DOCTYPE html><html><head><title>Dashboard [Jenkins]</title><link rel="stylesheet" href="/static/css/jenkins...\n'
+            "\n"
+            "  Parameter: image_url | Payload: http://169.254.169.254/latest/meta-data/\n"
+            "  Response: HTTP 200\n"
+            "  Body: ami-id\\ninstance-type\\nlocal-ipv4\\nplacement/availability-zone\n"
+            "\n"
+            "  Parameter: image_url | Payload: http://10.0.2.40:5432\n"
+            "  Response: HTTP 200 (non-HTTP binary data in body, connection established)\n"
+            "\n"
+            "  Parameter: image_url | Payload: https://example.com/image.png\n"
+            "  Response: HTTP 200 (valid image returned as expected)"
+        ),
+    },
+    "VULN-M002": {
+        "medium": (
+            "  [!] Authorization bypass detected\n"
+            "  Authenticated as user_id=100\n"
+            '  GET /api/v1/orders/5001 (own) → HTTP 200 {"order_id":5001,"user_id":100,"total":"$149.99"}\n'
+            '  GET /api/v1/orders/5002 (other) → HTTP 200 {"order_id":5002,"user_id":101,"email":"other@shop.com","card_last4":"7823"}\n'
+            "  Other users' PII and payment data accessible by changing the ID parameter.\n"
+            "  Assess the vulnerability type, severity, and impact."
+        ),
+        "hard": (
+            "  Authenticated as user_id=100 (session token: eyJhbGciOiJIUz...)\n"
+            '  GET /api/v1/orders/5001 → HTTP 200 {"order_id":5001,"user_id":100,"total":"$149.99","items":[...]}\n'
+            '  GET /api/v1/orders/5002 → HTTP 200 {"order_id":5002,"user_id":101,"email":"other@shop.com","card_last4":"7823","total":"$89.50"}\n'
+            '  GET /api/v1/orders/5003 → HTTP 200 {"order_id":5003,"user_id":102,"email":"bob@shop.com","card_last4":"1234","total":"$210.00"}\n'
+            "  GET /api/v1/orders/9999 → HTTP 404\n"
+            "  Note: Same session token used for all requests. No authorization header variation."
+        ),
+    },
+    "VULN-M003": {
+        "medium": (
+            "  [!] Sensitive data found in client-side code\n"
+            "  File: /static/js/app.bundle.js (847KB, minified)\n"
+            '  Line 1847: const STRIPE_KEY = "sk_live_51HG3jK2eZvKYlo2C0EXAMPLE_REDACTED";\n'
+            '  Line 2103: const API_INTERNAL = "http://10.0.2.20:8080/api/v1";\n'
+            "  Line 2210: // TODO: remove hardcoded keys before production\n"
+            "  Hardcoded API keys and internal URLs found in publicly accessible JavaScript.\n"
+            "  Assess the vulnerability type, severity, and impact."
+        ),
+        "hard": (
+            "  Analyzing /static/js/app.bundle.js (847KB, minified)\n"
+            "\n"
+            "  High-entropy strings detected:\n"
+            '    Line 1847: const STRIPE_KEY = "sk_live_51HG3jK2eZvKYlo2C0EXAMPLE_REDACTED";\n'
+            '    Line 2103: const API_INTERNAL = "http://10.0.2.20:8080/api/v1";\n'
+            "    Line 2210: // TODO: remove hardcoded keys before production\n"
+            '    Line 3891: const JWT_SECRET = "shopfast-jwt-secret-2024";\n'
+            "\n"
+            '  Shannon entropy analysis:\n'
+            '    "sk_live_51HG3jK..." — entropy: 5.2 bits/char\n'
+            '    "shopfast-jwt-secret..." — entropy: 3.8 bits/char\n'
+            "  Total: 4 suspicious strings across 1 file"
+        ),
+    },
+    "VULN-M004": {
+        "medium": (
+            "  [!] Service accessible without authentication\n"
+            "  GET / → HTTP 200 (Jenkins 2.235.1 dashboard rendered)\n"
+            "  GET /script → HTTP 200 (Groovy script console accessible)\n"
+            "  GET /manage → HTTP 200 (management interface rendered)\n"
+            "  No authentication challenge (WWW-Authenticate header absent).\n"
+            "  CI/CD system with script execution capability is fully open.\n"
+            "  Assess the vulnerability type, severity, and impact."
+        ),
+        "hard": (
+            "  GET / → HTTP 200\n"
+            "  Headers: X-Jenkins: 2.235.1, X-Hudson: 1.395\n"
+            '  Body (excerpt): <div id="main-panel">...Welcome to Jenkins!...2 jobs configured...\n'
+            "\n"
+            "  GET /script → HTTP 200\n"
+            '  Body (excerpt): <textarea name="script" id="script">...</textarea>\n'
+            "  Note: No redirect to /login, no 401/403 status\n"
+            "\n"
+            "  GET /manage → HTTP 200\n"
+            '  Body (excerpt): ...Manage Jenkins...Configure System...Manage Plugins...\n'
+            "\n"
+            "  GET /credentials → HTTP 200\n"
+            '  Body (excerpt): ...Stored credentials: 3 items...\n'
+            "\n"
+            "  Cookie: no JSESSIONID set (no session management active)"
+        ),
+    },
+    "VULN-M005": {
+        "medium": (
+            "  [!] Database accepts weak credentials\n"
+            "  PostgreSQL (port 5432):\n"
+            "    dbuser:dbpass123 → authentication successful\n"
+            "    Connected — 23 tables visible, no connection rate limiting detected\n"
+            "  Assess the vulnerability type, severity, and impact."
+        ),
+        "hard": (
+            "  PostgreSQL (port 5432):\n"
+            "    postgres:postgres → FATAL: password authentication failed\n"
+            "    admin:admin → FATAL: password authentication failed\n"
+            "    dbuser:dbuser → FATAL: password authentication failed\n"
+            "    dbuser:dbpass123 → connection established\n"
+            "\n"
+            '    SELECT current_user; → "dbuser"\n'
+            "    \\dt → 23 tables (users, orders, payments, sessions, ...)\n"
+            "    SELECT count(*) FROM users; → 12,847\n"
+            "\n"
+            "  Brute force timing: 50 attempts in 3.1s, no lockout or delay increase"
+        ),
+    },
+    "VULN-M006": {
+        "medium": (
+            "  [!] Outdated TLS configuration detected\n"
+            "  Server accepts TLSv1.0 and TLSv1.1 (deprecated protocols)\n"
+            "  Weak cipher suite accepted: TLS_RSA_WITH_3DES_EDE_CBC_SHA\n"
+            "  HSTS header not present in responses.\n"
+            "  Assess the vulnerability type, severity, and impact."
+        ),
+        "hard": (
+            "  TLS handshake analysis:\n"
+            "    Protocol: TLSv1.0 → ServerHello (accepted)\n"
+            "    Protocol: TLSv1.1 → ServerHello (accepted)\n"
+            "    Protocol: TLSv1.2 → ServerHello (accepted)\n"
+            "    Protocol: TLSv1.3 → handshake failure\n"
+            "\n"
+            "  Cipher suites (port 443):\n"
+            "    TLS_RSA_WITH_3DES_EDE_CBC_SHA → accepted\n"
+            "    TLS_RSA_WITH_AES_128_CBC_SHA → accepted\n"
+            "    TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 → accepted\n"
+            "\n"
+            "  Certificate: CN=shopfast.com, expires 2025-12-01, issuer=Let's Encrypt\n"
+            "  HSTS: not present\n"
+            "  OCSP stapling: not enabled"
+        ),
+    },
+    # ---- HARD: Enterprise SOC2 Pre-Audit ----
+    "VULN-H001": {
+        "hard": (
+            "  Parameter: message\n"
+            "  Payload: <img src=x onerror=alert(1)>\n"
+            "  POST /support/ticket → HTTP 200\n"
+            "\n"
+            "  Verification: GET /support/tickets?latest=true\n"
+            '  Response body excerpt:\n'
+            '    <div class="ticket-body"><img src=x onerror=alert(1)></div>\n'
+            "  Note: Payload persisted in database and rendered without encoding\n"
+            "  Note: Response headers — no Content-Security-Policy, X-XSS-Protection: 0"
+        ),
+    },
+    "VULN-H002": {
+        "hard": (
+            "  Authenticated as user_id=1001 (Bearer token: eyJhbG...)\n"
+            '  GET /api/v2/accounts/1001 → HTTP 200 {"account_id":1001,"name":"Test User","balance":"$12,450.00"}\n'
+            '  GET /api/v2/accounts/1002 → HTTP 200 {"account_id":1002,"name":"Jane Smith","balance":"$87,230.50","ssn_last4":"4829"}\n'
+            '  GET /api/v2/accounts/1003 → HTTP 200 {"account_id":1003,"name":"Bob Jones","balance":"$3,100.00","ssn_last4":"7712"}\n'
+            "  GET /api/v2/accounts/9999 → HTTP 404\n"
+            "\n"
+            "  Note: Sequential integer IDs. Same Bearer token used for all requests.\n"
+            "  Note: PII (SSN, balance) exposed in cross-account responses."
+        ),
+    },
+    "VULN-H003": {
+        "hard": (
+            "  Concurrency testing on /api/v2/transfer:\n"
+            "  Initial balance (account 1001): $1,000.00\n"
+            "\n"
+            "  Sent 5 concurrent POST requests:\n"
+            '    {"from_account":"1001","to_account":"1002","amount":800}\n'
+            "\n"
+            '  Request 1: HTTP 200 {"status":"success","new_balance":"$200.00"}\n'
+            '  Request 2: HTTP 200 {"status":"success","new_balance":"$200.00"}\n'
+            '  Request 3: HTTP 200 {"status":"success","new_balance":"$200.00"}\n'
+            '  Request 4: HTTP 402 {"error":"insufficient_funds"}\n'
+            '  Request 5: HTTP 402 {"error":"insufficient_funds"}\n'
+            "\n"
+            "  Final balance (account 1001): -$1,400.00\n"
+            "  Total transferred: $2,400.00 from initial $1,000.00\n"
+            "  Note: Balance check and deduction are not atomic operations"
+        ),
+    },
+    "VULN-H004": {
+        "hard": (
+            "  Parameter: template\n"
+            "\n"
+            "  Payload: {{7*7}}\n"
+            '  Response body (excerpt): ...Report generated: 49...\n'
+            "\n"
+            "  Payload: {{config}}\n"
+            "  Response body (excerpt): ...Report generated: <Config {'ENV': 'production', "
+            "'DEBUG': False, 'SECRET_KEY': 'finsecure-prod-key-2024'}>...\n"
+            "\n"
+            '  Payload: {{"test".__class__.__mro__}}\n'
+            "  Response body (excerpt): ...(<class 'str'>, <class 'object'>)...\n"
+            "\n"
+            "  Server headers: X-Powered-By: Flask/Jinja2"
+        ),
+    },
+    "VULN-H005": {
+        "hard": (
+            "  File upload testing:\n"
+            '  test.txt (text/plain, 12B) → HTTP 200 {"path":"/uploads/test.txt"}\n'
+            '  test.php (application/x-php, 45B) → HTTP 200 {"path":"/uploads/test.php"}\n'
+            '  test.jsp (text/plain, 38B) → HTTP 200 {"path":"/uploads/test.jsp"}\n'
+            '  test.exe (application/octet-stream, 1KB) → HTTP 200 {"path":"/uploads/test.exe"}\n'
+            "\n"
+            "  Verification: GET /uploads/test.php\n"
+            "  Response: HTTP 200, Content-Type: text/html\n"
+            "  Body: PHP output rendered (file was executed server-side, not served as static)"
+        ),
+    },
+    "VULN-H006": {
+        "hard": (
+            "  FTP (port 21):\n"
+            "    USER anonymous → 331 Please specify the password\n"
+            "    PASS anonymous → 230 Login successful\n"
+            "    LIST → drwxr-xr-x  /reports/\n"
+            "           drwxr-xr-x  /employee_data/\n"
+            "           drwxr-xr-x  /financial/\n"
+            "    SIZE /financial/Q4_2024_revenue.xlsx → 2,371,584 bytes\n"
+            "\n"
+            "  SMB (port 445):\n"
+            "    Guest session → \\\\10.0.3.40\\Documents accessible (read/write)\n"
+            "    Files: employee_roster.csv, salary_data.xlsx, board_minutes_2024.pdf\n"
+            "    Share permissions: Everyone:FullControl"
+        ),
+    },
+    "VULN-H007": {
+        "hard": (
+            "  FTP (port 21):\n"
+            "    AUTH TLS → 502 Command not implemented\n"
+            "    FEAT → no TLS/SSL features advertised\n"
+            "\n"
+            "  Captured transfer: RETR /financial/Q4_2024_revenue.xlsx\n"
+            "  Transfer protocol: plaintext FTP (no encryption layer)\n"
+            "  Wireshark excerpt: PASS command visible in cleartext in packet capture\n"
+            "\n"
+            "  SFTP (port 22): connection refused\n"
+            "  FTPS: not available"
+        ),
+    },
+    "VULN-H008": {
+        "hard": (
+            "  SMTP (port 25, Postfix 3.4.13):\n"
+            "    EHLO test.local → 250 mail.finsecure.com\n"
+            "    MAIL FROM:<attacker@external.com> → 250 2.1.0 Ok\n"
+            "    RCPT TO:<ceo@finsecure.com> → 250 2.1.5 Ok\n"
+            "    DATA → 354 End data with <CR><LF>.<CR><LF>\n"
+            "    (message accepted from unauthenticated external sender)\n"
+            "\n"
+            "  DNS records for finsecure.com:\n"
+            "    SPF: no record found\n"
+            "    DKIM: no record found\n"
+            "    DMARC: no record found\n"
+            "    MX: mail.finsecure.com → 10.0.3.50"
+        ),
+    },
+    "VULN-H009": {
+        "hard": (
+            "  Port 8080: Apache Tomcat/9.0.31\n"
+            "\n"
+            "  NVD database results:\n"
+            "    CVE-2020-1938 — Base Score: 9.8 — AJP connector file read/inclusion\n"
+            "    CVE-2020-9484 — Base Score: 7.0 — session persistence deserialization\n"
+            "    CVE-2020-11996 — Base Score: 7.5 — HTTP/2 request handling DoS\n"
+            "\n"
+            "  Port 8009/tcp: open (AJP13 protocol response received)\n"
+            "  Note: AJP connector externally accessible from scan origin"
+        ),
+    },
+    "VULN-H010": {
+        "hard": (
+            "  POST /login — rate limit testing:\n"
+            "  Batch 1: 100 requests with invalid creds in 2.3s\n"
+            "    Result: 100x HTTP 401 (all processed, none blocked)\n"
+            "    Avg response time: 23ms (consistent, no throttling)\n"
+            "\n"
+            "  Batch 2: 1000 requests in 18.7s\n"
+            "    Result: 1000x HTTP 401\n"
+            "    No CAPTCHA challenge\n"
+            "    No account lockout triggered\n"
+            "    No IP-based blocking detected\n"
+            "\n"
+            "  Response headers: no X-RateLimit-* headers present"
+        ),
+    },
+}
+def _labeled_output(v: Dict) -> str:
+    """Easy mode: full labeled finding with CWE, CVSS, remediation."""
+    tag = {
+        "Critical": "CRITICAL", "High": "ALERT",
+        "Medium": "WARNING", "Low": "INFO",
+    }.get(v["severity"], "FINDING")
+    return (
+        f"  [{tag}] {v['type']} DETECTED\n"
+        f"  CWE: {v['cwe']}\n"
+        f"  OWASP: {v['owasp']}\n"
+        f"  Evidence: {v['evidence']}\n"
+        f"  Suggested CVSS: {v['cvss']} ({v['severity']})\n"
+        f"  Remediation: {v['remediation']}"
+    )
+def _format_detection(v: Dict, scenario: Dict, header: str) -> str:
+    """Pick output format based on scenario difficulty."""
+    difficulty = scenario.get("id", "easy")
+    if difficulty == "easy":
+        return _labeled_output(v)
+    raw = _RAW_OUTPUTS.get(v["id"], {}).get(difficulty)
+    if raw:
+        return raw
+    # Fallback for medium vulns without explicit raw output
+    if difficulty == "medium":
+        return (
+            f"  [!] Potential issue detected\n"
+            f"  Observation: {v['evidence']}\n"
+            f"  Assess the vulnerability type, severity, and impact."
+        )
+    # Fallback for hard vulns without explicit raw output
+    return (
+        f"  Observation: {v['evidence']}\n"
+        f"  No further classification available — manual analysis required."
+    )
 def execute_tool(
     tool_name: str,
     arguments: Dict[str, Any],
     vulns = [v for v in scenario.get("vulnerabilities", [])
              if v["host"] == host and "vulnerability_scan" in v.get("discoverable_by", [])]
     vulns = _filter_by_chain(vulns, discovered_vulns)
+    difficulty = scenario.get("id", "easy")
     lines = [f"Vulnerability scan for {host}:", ""]
     reported = set()
         found_any = False
         for v in vulns:
             if v["id"] not in reported:
+                if difficulty == "easy":
+                    lines.append(f"    [!] VULNERABLE: {v['type']} (CVSS {v['cvss']}) — {v['cwe']}")
+                    lines.append(f"        Confidence: High | Source: NVD/OWASP")
+                    lines.append(f"        Remediation: {v['remediation']}")
+                else:
+                    raw = _RAW_OUTPUTS.get(v["id"], {}).get(difficulty)
+                    if raw:
+                        lines.append(raw)
+                    else:
+                        # Generic CVE-style output without explicit classification
+                        lines.append(f"    NVD match for {p['version']}")
+                        lines.append(f"    See output details for assessment.")
                 reported.add(v["id"])
                 found_any = True
         if not found_any:
     if vulns:
         lines = [f"Injection testing on {host}{endpoint}:", ""]
         for v in vulns:
+            lines.append(_format_detection(v, scenario, f"Injection testing on {host}{endpoint}"))
+            lines.append("")
         return "\n".join(lines), [], {}, 0.08
     else:
         return (f"Injection testing on {host}{endpoint}:\n"
     if vulns:
         lines = [f"XSS testing on {host}{endpoint}:", ""]
         for v in vulns:
+            lines.append(_format_detection(v, scenario, f"XSS testing on {host}{endpoint}"))
+            lines.append("")
         return "\n".join(lines), [], {}, 0.08
     else:
         return (f"XSS testing on {host}{endpoint}:\n"
     if vulns:
         lines = [f"Auth & access control testing on {host}" + (endpoint or "") + ":", ""]
         for v in vulns:
+            lines.append(_format_detection(v, scenario, f"Auth testing on {host}"))
+            lines.append("")
         return "\n".join(lines), [], {}, 0.08
     else:
         target = f"{host}{endpoint}" if endpoint else host
     if vulns:
         lines = [f"Configuration audit for {host}:", ""]
         for v in vulns:
+            lines.append(_format_detection(v, scenario, f"Configuration audit for {host}"))
+            lines.append("")
         return "\n".join(lines), [], {}, 0.08
     else:
         return (f"Configuration audit for {host}:\n"
     if vulns:
         lines = [f"Cryptographic analysis for {host}:", ""]
         for v in vulns:
+            lines.append(_format_detection(v, scenario, f"Cryptographic analysis for {host}"))
+            lines.append("")
         return "\n".join(lines), [], {}, 0.06
     else:
         return (f"Cryptographic analysis for {host}:\n"
     if vulns:
         lines = [f"Secret scanning on {host}" + (endpoint or "") + ":", ""]
         for v in vulns:
+            lines.append(_format_detection(v, scenario, f"Secret scanning on {host}"))
+            lines.append("")
         return "\n".join(lines), [], {}, 0.08
     else:
         target = f"{host}{endpoint}" if endpoint else host