Humanlearning commited on
Commit
00095ba
·
verified ·
1 Parent(s): 5809cd1

Upload folder using huggingface_hub

Browse files
Dockerfile ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ # Multi-stage build using openenv-base
8
+ # This Dockerfile is flexible and works for both:
9
+ # - In-repo environments (with local OpenEnv sources)
10
+ # - Standalone environments (with openenv from PyPI/Git)
11
+ # The build script (openenv build) handles context detection and sets appropriate build args.
12
+
13
+ ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
14
+ FROM ${BASE_IMAGE} AS builder
15
+
16
+ WORKDIR /app
17
+
18
+ # Ensure git is available (required for installing dependencies from VCS)
19
+ RUN apt-get update && \
20
+ apt-get install -y --no-install-recommends git && \
21
+ rm -rf /var/lib/apt/lists/*
22
+
23
+ # Build argument to control whether we're building standalone or in-repo
24
+ ARG BUILD_MODE=in-repo
25
+ ARG ENV_NAME=Cyber_analyst
26
+
27
+ # Copy environment code (always at root of build context)
28
+ COPY . /app/env
29
+
30
+ # For in-repo builds, openenv is already vendored in the build context
31
+ # For standalone builds, openenv will be installed via pyproject.toml
32
+ WORKDIR /app/env
33
+
34
+ # Ensure uv is available (for local builds where base image lacks it)
35
+ RUN if ! command -v uv >/dev/null 2>&1; then \
36
+ curl -LsSf https://astral.sh/uv/install.sh | sh && \
37
+ mv /root/.local/bin/uv /usr/local/bin/uv && \
38
+ mv /root/.local/bin/uvx /usr/local/bin/uvx; \
39
+ fi
40
+
41
+ # Install dependencies using uv sync
42
+ # If uv.lock exists, use it; otherwise resolve on the fly
43
+ RUN --mount=type=cache,target=/root/.cache/uv \
44
+ if [ -f uv.lock ]; then \
45
+ uv sync --frozen --no-install-project --no-editable; \
46
+ else \
47
+ uv sync --no-install-project --no-editable; \
48
+ fi
49
+
50
+ RUN --mount=type=cache,target=/root/.cache/uv \
51
+ if [ -f uv.lock ]; then \
52
+ uv sync --frozen --no-editable; \
53
+ else \
54
+ uv sync --no-editable; \
55
+ fi
56
+
57
+ # Final runtime stage
58
+ FROM ${BASE_IMAGE}
59
+
60
+ WORKDIR /app
61
+
62
+ # Copy the virtual environment from builder
63
+ COPY --from=builder /app/env/.venv /app/.venv
64
+
65
+ # Copy the environment code
66
+ COPY --from=builder /app/env /app/env
67
+
68
+ # Set PATH to use the virtual environment
69
+ ENV PATH="/app/.venv/bin:$PATH"
70
+
71
+ # Set PYTHONPATH so imports work correctly
72
+ ENV PYTHONPATH="/app/env:$PYTHONPATH"
73
+
74
+ # Health check
75
+ HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
76
+ CMD curl -f http://localhost:8000/health || exit 1
77
+
78
+ # Run the FastAPI server
79
+ # The module path is constructed to work with the /app/env structure
80
+ ENV ENABLE_WEB_INTERFACE=true
81
+ CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]
README.md CHANGED
@@ -1,10 +1,158 @@
1
  ---
2
- title: Cyber Analyst
3
- emoji: 📈
4
- colorFrom: blue
5
- colorTo: pink
6
  sdk: docker
7
  pinned: false
 
 
 
 
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Cyber Analyst Environment Server
3
+ emoji: 🎯
4
+ colorFrom: pink
5
+ colorTo: red
6
  sdk: docker
7
  pinned: false
8
+ app_port: 8000
9
+ base_path: /web
10
+ tags:
11
+ - openenv
12
  ---
13
 
14
+ # Cyber Analyst Environment
15
+
16
+ Cyber Analyst is an OpenEnv implementation of the "SecOps Evidence Gym" design from `docs/deep-research-report.md`. It benchmarks a bounded, safe security triage workflow: investigate synthetic artifacts, cite evidence IDs, validate candidate findings with deterministic verifiers, and submit a remediation report.
17
+
18
+ The environment contains no live targets, no real secrets, no exploit workflow, no shell, and no outbound investigation tools. All evidence is static synthetic lab data.
19
+
20
+ ## Tasks
21
+
22
+ The manifest ships three graded tasks:
23
+
24
+ | Task id | Difficulty | Goal |
25
+ | --- | --- | --- |
26
+ | `secret_exposure_easy` | easy | Find a synthetic API-key-like secret in a repo snapshot and propose removal plus rotation. |
27
+ | `missing_security_headers_medium` | medium | Detect missing HSTS/CSP headers in a synthetic gateway header snapshot. |
28
+ | `authz_boundary_hard` | hard | Detect an admin route role-policy mismatch without exploitation. |
29
+
30
+ ## Action Contract
31
+
32
+ Use one bounded tool call per `step`:
33
+
34
+ ```python
35
+ CyberAnalystAction(
36
+ tool_name="search_repo",
37
+ args={"query": "api key"},
38
+ )
39
+ ```
40
+
41
+ Approved tools:
42
+
43
+ - `list_assets()`
44
+ - `get_log_events(service_id, query)`
45
+ - `check_security_headers(service_id)`
46
+ - `search_repo(query)`
47
+ - `scan_dependencies()`
48
+ - `create_finding(finding_type, evidence_ids, severity_guess, remediation)`
49
+ - `validate_finding(finding_id)`
50
+ - `submit_report(report_json)`
51
+
52
+ ## Observation Contract
53
+
54
+ Each observation includes:
55
+
56
+ - `alert`: task prompt
57
+ - `tool_catalog`: approved tool list
58
+ - `tool_result`: latest tool result
59
+ - `evidence_ids`: discovered evidence IDs
60
+ - `candidate_findings`: created findings
61
+ - `verified_findings`: verifier-confirmed findings
62
+ - `score_breakdown`: deterministic scoring explanation
63
+ - `step_budget_remaining`, `error`, `done`, and `reward`
64
+
65
+ Rewards and final scores are clamped to `0.01..0.99` for validator compatibility.
66
+
67
+ `submit_report` also returns `trajectory_jsonl`, a JSONL export of the episode
68
+ events up to report submission. This is intended for offline inspection and
69
+ future training data extraction.
70
+
71
+ ## Local Run
72
+
73
+ From this directory:
74
+
75
+ ```bash
76
+ uv run server
77
+ ```
78
+
79
+ Then connect with the client:
80
+
81
+ ```python
82
+ from Cyber_analyst import CyberAnalystAction, CyberAnalystEnv
83
+
84
+ with CyberAnalystEnv(base_url="http://localhost:8000").sync() as env:
85
+ result = env.reset(task_id="secret_exposure_easy", seed=7)
86
+ result = env.step(CyberAnalystAction(tool_name="search_repo", args={"query": "api key"}))
87
+ print(result.observation.tool_result)
88
+ ```
89
+
90
+ ## Baseline Inference
91
+
92
+ `inference.py` runs a deterministic scripted baseline and prints strict parser-friendly logs:
93
+
94
+ ```text
95
+ [START] task=<task_id> env=Cyber_analyst model=<model_name>
96
+ [STEP] step=<n> action=<tool_name> reward=<0.00> done=<true|false> error=<msg|null>
97
+ [END] task=<task_id> success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...>
98
+ ```
99
+
100
+ LLM calls are not enabled by default. The script already includes OpenAI SDK configuration compatible with Hugging Face Inference Providers so model-backed report drafting can be added later:
101
+
102
+ ```bash
103
+ set ENV_URL=http://localhost:8000
104
+ set API_BASE_URL=https://router.huggingface.co/v1
105
+ set MODEL_NAME=openai/gpt-oss-120b:novita
106
+ set HF_TOKEN=<your-hugging-face-token>
107
+ python inference.py
108
+ ```
109
+
110
+ ## Validation
111
+
112
+ Useful local checks:
113
+
114
+ ```bash
115
+ python -m py_compile server/Cyber_analyst_environment.py inference.py
116
+ python -m pytest tests
117
+ .\.venv\Scripts\openenv.exe validate . --json
118
+ ```
119
+
120
+ ## Docker
121
+
122
+ Build the environment image from this directory:
123
+
124
+ ```bash
125
+ docker build -t cyber-analyst-env:latest -f server/Dockerfile .
126
+ ```
127
+
128
+ Run:
129
+
130
+ ```bash
131
+ docker run -p 8000:8000 cyber-analyst-env:latest
132
+ ```
133
+
134
+ Health check:
135
+
136
+ ```bash
137
+ curl http://localhost:8000/health
138
+ ```
139
+
140
+ ## Deployment
141
+
142
+ Deploy to Hugging Face Spaces with OpenEnv:
143
+
144
+ ```bash
145
+ openenv push --repo-id <your-hf-username>/Cyber_analyst
146
+ ```
147
+
148
+ The deployed Space exposes `/health`, `/docs`, `/ws`, and the optional `/web` interface when web UI support is enabled by the OpenEnv runtime.
149
+
150
+ ## Adding Scenarios
151
+
152
+ Add new safe scenarios in `server/tasks.py` by extending `SCENARIOS` with:
153
+
154
+ - a stable `task_id`
155
+ - synthetic `assets`, `repo`, `logs`, `headers`, and `dependencies` entries
156
+ - `ground_truth_id`, `finding_type`, `required_evidence`, `impact_keywords`, and `remediation_keywords`
157
+
158
+ Then add a grader adapter in `server/graders.py` and a matching `tasks` entry in `openenv.yaml`. Keep all artifacts synthetic, keep correctness deterministic, and avoid adding real targets or arbitrary execution tools.
__init__.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """Cyber Analyst Environment."""
8
+
9
+ from .client import CyberAnalystEnv
10
+ from .models import CyberAnalystAction, CyberAnalystObservation, CyberAnalystState
11
+
12
+ __all__ = [
13
+ "CyberAnalystAction",
14
+ "CyberAnalystObservation",
15
+ "CyberAnalystState",
16
+ "CyberAnalystEnv",
17
+ ]
client.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """Cyber Analyst Environment client."""
8
+
9
+ from typing import Any
10
+
11
+ from openenv.core import EnvClient
12
+ from openenv.core.client_types import StepResult
13
+
14
+ from .models import CyberAnalystAction, CyberAnalystObservation, CyberAnalystState
15
+
16
+
17
+ class CyberAnalystEnv(
18
+ EnvClient[CyberAnalystAction, CyberAnalystObservation, CyberAnalystState]
19
+ ):
20
+ """WebSocket client for the Cyber Analyst OpenEnv environment."""
21
+
22
+ def _step_payload(self, action: CyberAnalystAction) -> dict[str, Any]:
23
+ return action.model_dump(exclude_none=True)
24
+
25
+ def _parse_result(
26
+ self, payload: dict[str, Any]
27
+ ) -> StepResult[CyberAnalystObservation]:
28
+ obs_data = dict(payload.get("observation", {}))
29
+ obs_data["done"] = payload.get("done", False)
30
+ obs_data["reward"] = payload.get("reward")
31
+ observation = CyberAnalystObservation.model_validate(obs_data)
32
+ return StepResult(
33
+ observation=observation,
34
+ reward=payload.get("reward"),
35
+ done=payload.get("done", False),
36
+ )
37
+
38
+ def _parse_state(self, payload: dict[str, Any]) -> CyberAnalystState:
39
+ return CyberAnalystState.model_validate(payload)
docs/deep-research-report.md ADDED
@@ -0,0 +1,531 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenEnv Hackathon Execution Pipeline for a Safe Cybersecurity Analyst Environment
2
+
3
+ ## Executive decision and scope selection
4
+
5
+ **SECTION 1 — Executive Decision**
6
+
7
+ [SOURCED] The hackathon’s *validator + judging* constraints strongly favour environments that: (a) simulate a real-world task (not “games/toys”), (b) are fully OpenEnv-compliant, (c) ship with **≥3 tasks with graders**, (d) produce **scores/rewards in the 0–1 range**, (e) include a reproducible `inference.py` that uses the **OpenAI client** (for any LLM calls) and prints strict `[START]/[STEP]/[END]` logs, and (f) run within a ~**20 minute** budget on ~**2 vCPU / 8 GB** infra. citeturn3view6turn3view7turn22view1turn22view2
8
+
9
+ [INFERENCE] Under these realities, your cybersecurity-analyst direction is *not* automatically the best, but it *can* become a high-probability-to-win choice if—and only if—you narrow to a deterministic, bounded, “investigate → cite evidence → verify → remediate” loop where (i) tools are tightly sandboxed, (ii) graders are deterministic, and (iii) the action space is small enough to be learnable and demo-able under the runtime cap.
10
+
11
+ [PROPOSAL] **Decision:** keep the cybersecurity direction, but **narrow aggressively** to a V1 environment that benchmarks **disciplined security triage + evidence-grounded reporting**, not pentesting/exploitation. The V1 I recommend building is:
12
+
13
+ [PROPOSAL] **“SecOps Evidence Gym”** — a safe, isolated OpenEnv environment where an agent investigates a *synthetic* microservice “organisation” via a **bounded tool API**, collects **evidence IDs**, validates candidate findings through **deterministic verifiers**, and submits a structured remediation report.
14
+
15
+ [SOURCED] This matches strong “winner DNA” seen in `kube-sre-gym` (realistic professional workflow + verification + narrative clarity) while remaining implementable in a hackathon budget. citeturn10view0turn18view0
16
+
17
+ [PROPOSAL] **What to cut entirely in V1 (non-negotiable):**
18
+ - “Live target” behaviour; no external network targets; no arbitrary HTTP to the internet. 🔒
19
+ - Any exploit payload recipes, exploit chains, privilege-escalation playbooks, or “how to hack X”. 🔒
20
+ - Arbitrary shell access (`bash`, `kubectl`, `nmap`, etc.) inside the environment. (Action space explosion + safety risk.)
21
+ - LLM-only grading/judging for correctness. (Reward hacking + non-determinism.)
22
+
23
+ [SOURCED] **What to keep (but narrow):** tool-using investigation, multi-step interaction, and deterministic verification—these are consistent with what OpenEnv is designed to support (typed `reset/step/state`, isolated server, type-safe schemas). citeturn18view0turn19search1
24
+
25
+ **SECTION 3 — Candidate Scope Comparison**
26
+
27
+ [SOURCED] The scoring below is anchored on hackathon validator requirements (3+ graded tasks, 0–1 scoring, strict inference logging, runtime limits) plus OpenEnv’s scaffolding/CLI/deployment model. citeturn3view6turn18view0turn22view1
28
+
29
+ [PROPOSAL] Weighted criteria (sum=1.00): judging fit 0.14, OpenEnv fit 0.12, grader determinism 0.14, implementation risk 0.12, runtime feasibility 0.10, demoability 0.10, real-world usefulness 0.10, novelty 0.08, training usefulness 0.06, shipping-on-time likelihood 0.04.
30
+
31
+ | Candidate scope | Judging fit | OpenEnv fit | Determinism | Impl risk (lower=better) | Runtime | Demo | Real-world use | Novelty | Training use | Ship-on-time | Weighted total |
32
+ |---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|
33
+ | **A. Your original direction:** “disciplined cyber analyst investigating a sandbox” (broad) | 8 | 7 | 6 | 4 | 6 | 8 | 8 | 8 | 8 | 4 | **6.7** |
34
+ | **B. Narrow cyber variant (recommended):** evidence-first triage lab with bounded tools + deterministic verifiers + structured report | 9 | 8 | 9 | 7 | 9 | 9 | 8 | 7 | 8 | 8 | **8.4** |
35
+ | **C. Adjacent: SRE incident triage (single-turn, deterministic logs → RCA)** | 9 | 8 | 10 | 9 | 10 | 8 | 8 | 5 | 6 | 9 | **8.3** |
36
+ | **D. Adjacent: prompt-injection “WAF” benchmark** | 7 | 8 | 8 | 7 | 9 | 9 | 7 | 7 | 6 | 7 | **7.6** |
37
+
38
+ [INFERENCE] Candidate C (SRE triage) is extremely validator-safe (many examples already pass deep validation), but it is likely more saturated and less “new” in judging. Candidate B keeps your cybersecurity theme while retaining the determinism and boundedness that the validator and time budget demand, so it is the best balance for **winning + real-world usefulness**.
39
+
40
+ **SECTION 4 — Final V1 Problem Statement**
41
+
42
+ [PROPOSAL] **One-sentence version:** Build a safe OpenEnv environment that trains and benchmarks an agent to perform **evidence-grounded security triage** on a bounded synthetic system and produce a **remediation-oriented report** without hallucinating.
43
+
44
+ [PROPOSAL] **Short pitch (demo-ready):** “SecOps Evidence Gym” gives the model an alert and a constrained set of investigation tools. The model must gather evidence, validate findings with deterministic verifiers, and submit a structured report. Scores reward verified correctness, penalise hallucinated claims and wasted steps, and remain strictly within (0,1) to satisfy the hackathon validator. ✅🔒
45
+
46
+ [PROPOSAL] **Precise implementation version:** A FastAPI OpenEnv server exposing typed `reset()`, `step()`, `state()` and a manifest `openenv.yaml` with at least three tasks (easy/medium/hard), each associated with a grader. The environment implements multi-step tool calls inside `step()`, and uses deterministic verifiers plus a strict score clamp to `(0,1)` for validator compatibility. citeturn18view0turn19search1turn23view0turn26view0turn27search0
47
+
48
+ ## Winner patterns and judging fit extraction
49
+
50
+ **SECTION 2 — Winner Pattern Extraction**
51
+
52
+ [SOURCED] `kube-sre-gym` (OpenEnv Hackathon SF winner) demonstrates several patterns that appear strongly aligned with what judges value: a **realistic professional task**, an explicit **multi-step workflow** (triage → investigate → fix → verify), **multi-layer verification** (programmatic health checks + judge), a strong narrative that explains learning dynamics, and deployment on **Hugging Face Spaces**. citeturn10view0turn14view0
53
+
54
+ image_group{"layout":"carousel","aspect_ratio":"16:9","query":["kube-sre-gym OpenEnv Hackathon winner screenshot","OpenEnv framework architecture diagram","Hugging Face Spaces OpenEnv environment web demo","Incident Triage Environment OpenEnv Hackathon 2026 screenshot"],"num_per_query":1}
55
+
56
+ [SOURCED] Concretely, `kube-sre-gym` highlights: “real cluster/tool interaction”, verification layers to prevent false success, curriculum progression, and strong documentation that makes the environment’s value obvious quickly. citeturn10view0
57
+
58
+ [SOURCED] A 2026 hackathon submission that explicitly claims Phase-2 deep-validation success (`Incident-Triage-Environment`) reveals particularly transferable “validator-winning” patterns:
59
+ - A manifest with `spec_version`, `runtime`, `app`, `port`, and a **`tasks:` list where each task has a `grader:` pointing to importable Python functions**. citeturn23view0
60
+ - Deterministic graders that clamp outputs to a validator-friendly range. citeturn26view0turn26view3
61
+ - An `inference.py` that uses the OpenAI SDK and prints the strict stdout protocol with `[START]`, `[STEP]`, `[END]` lines in a stable key=value format. citeturn22view1turn22view2turn22view4
62
+
63
+ [SOURCED] The official OpenEnv repo reinforces what is transferable: type-safe action/observation/state contracts, containerised isolation, and standard Gym-like APIs. It also explicitly says OpenEnv is experimental and APIs can change, which increases the value of a minimal, validator-first build loop. citeturn18view0turn19search2
64
+
65
+ [INFERENCE] Given hackathon evaluation combines programmatic checks with LLM scoring, you must optimise for **deterministic correctness** *and* a compelling narrative/demo. citeturn3view7turn3view6
66
+
67
+ [PROPOSAL] Transferable “winner patterns” you should copy (selectively):
68
+ - **Strong “professional workflow” framing** (SRE, security analyst, triage) with clear step boundaries.
69
+ - **Small, discoverable tool set** that mirrors real practice (logs, config, policy checks) while staying bounded.
70
+ - **Deterministic verification** (programmatic checks) as the source of truth for correctness.
71
+ - **Narrative traceability**: logs, episode IDs, and a short “watch the agent work” demo.
72
+ - **Deployment excellence**: clean Docker build, working `/health`, working `/web` UI if enabled, and reproducible inference.
73
+
74
+ [PROPOSAL] What *not* to copy blindly:
75
+ - The “real cluster” dependency (e.g., GKE) is high operational burden and can fail the hackathon’s limited infra budget. citeturn10view0turn3view6
76
+ - LLM-as-judge for correctness (too easy to reward-hack; non-deterministic). (Use it, at most, for *format/style*, not correctness.)
77
+
78
+ ## Core V1 environment design and benchmark tasks
79
+
80
+ **SECTION 8 — Core Environment Design**
81
+
82
+ **V1 concept (aggressively narrow).**
83
+ [PROPOSAL] Your environment is a **synthetic organisation** with a small, fixed topology (three “services” + artefacts). The agent receives an alert. It can only interact via **approved tools** (implemented inside the simulator). It must (a) gather evidence IDs, (b) validate candidate findings, and (c) submit a report.
84
+
85
+ **Topology (V1).**
86
+ [PROPOSAL] Fixed components (no containers inside containers):
87
+ - `gateway` (public entry), `profile-service`, `admin-service`
88
+ - `repo_snapshot` (static code/config excerpts)
89
+ - `telemetry` (sanitised logs + “header snapshot” + “dependency manifest snapshot”)
90
+
91
+ **Reset logic.**
92
+ [PROPOSAL] `reset(task_id=..., seed=...)` selects a scenario variant and initialises:
93
+ - episode ID, step count
94
+ - scenario ground truth (one injected issue per episode in V1)
95
+ - tool budgets + “allowed scope” banner
96
+ - an evidence registry mapping `EVID-### → artefact snippet`
97
+ Return an initial observation containing the alert, the tool catalogue, and an empty “verified findings” list.
98
+
99
+ **Randomisation strategy.**
100
+ [PROPOSAL] Use seed-driven, deterministic randomisation:
101
+ - rename services/routes/IDs (`profile-service` might become `user-profile`),
102
+ - shuffle benign log lines around the key evidence,
103
+ - vary exact header sets / dependency versions within a small closed set,
104
+ - keep each scenario **fully reproducible from the seed**.
105
+
106
+ [SOURCED] Benchmark generators (e.g., AMaze) exist specifically to create diverse but controlled environments for evaluating generalisation, supporting the idea of seeded procedural variation rather than a single static scenario. citeturn16search7turn16search1
107
+
108
+ **Safety boundaries.**
109
+ [PROPOSAL] The sandbox contains **no live targets**, no real secrets, and no outbound network. “Secrets” are synthetic strings with an explicit “DO NOT USE OUTSIDE LAB” marker. Tools return synthetic results only. 🔒
110
+
111
+ [SOURCED] NIST’s cyber range guidance emphasises cyber ranges as safe and legal environments for training and assessment; separate research also discusses that cyber ranges themselves have security risks that must be mitigated (e.g., leakage/misuse), reinforcing the need for strict isolation and artefact sanitisation. citeturn29search1turn29search2
112
+
113
+ **How state is exposed to the agent.**
114
+ [PROPOSAL] Expose only a concise state summary: current phase, step budget remaining, tools remaining, verified findings count, and recent evidence IDs. Keep full ground truth hidden.
115
+
116
+ **Tool/action design (bounded action space).**
117
+ [PROPOSAL] V1 tool list (keep it ≤8 tools):
118
+ 1) `list_assets()` → returns asset IDs and route IDs
119
+ 2) `get_log_events(service_id, query)` → returns evidence IDs
120
+ 3) `check_security_headers(service_id)` → returns evidence IDs + pass/fail list
121
+ 4) `search_repo(query)` → returns evidence IDs from code snippets
122
+ 5) `scan_dependencies()` → returns evidence IDs from a lockfile excerpt
123
+ 6) `create_finding(finding_type, evidence_ids, severity_guess, remediation)` → stores candidate finding
124
+ 7) `validate_finding(finding_id)` → deterministic verifier; returns `(verified, matching_gt_id)`
125
+ 8) `submit_report(report_json)` → terminal action
126
+
127
+ **Anti-loop logic.**
128
+ [PROPOSAL] Track action signatures `(tool_name, args_hash)` and:
129
+ - apply increasing penalties for repeats,
130
+ - hard-stop an episode if identical actions repeat ≥6 times, returning `done=True` with a low score,
131
+ - always return a valid observation (never a server crash) to preserve training rollouts.
132
+
133
+ [SOURCED] OpenEnv’s environment-creation guidance strongly implies you should implement robust behaviour around `reset/step/state` with typed contracts and predictable server behaviour. citeturn19search1turn18view0
134
+
135
+ **SECTION 9 — Tasks / Benchmarks**
136
+
137
+ [SOURCED] The hackathon requires **at least 3 tasks with graders** and explicitly checks the tasks registry. citeturn3view6turn27search0
138
+
139
+ [PROPOSAL] V1 ships exactly **3 flagship tasks**, difficulty-tiered, each with deterministic success criteria and intermediate milestones.
140
+
141
+ **Flagship tasks (easy/medium/hard).**
142
+ [PROPOSAL] Each task is a *family* with small seeded variants.
143
+
144
+ **Easy: Secret exposure in repo snapshot**
145
+ - Goal: identify a leaked synthetic API key in a config file excerpt; propose rotation/removal.
146
+ - Deterministic success: report includes the correct finding type `secret_exposure`, includes ≥1 correct evidence ID, and remediation mentions rotation + removal.
147
+ - Intermediate rewards: `search_repo()` surfaces the evidence ID; `create_finding()` with correct type gets partial credit; `validate_finding()` confirms.
148
+ - False-positive check: claiming *additional* vulnerabilities not verified triggers penalty.
149
+
150
+ **Medium: Missing security headers**
151
+ - Goal: detect missing/weak security headers in a service “header snapshot”; propose remediation.
152
+ - Deterministic success: correct missing header set identification (from a fixed list), plus remediation mapping (e.g., add HSTS, CSP) within the environment’s rubric.
153
+ - Intermediate rewards: correct tool usage (`check_security_headers()`), correct mapping to finding type, successful verifier validation.
154
+ - Generalisation: header ordering/extra benign headers vary by seed.
155
+
156
+ **Hard: Authorisation boundary misconfiguration**
157
+ - Goal: detect an access control policy bug in a route/role matrix (modelled safely, without exploitation).
158
+ - Deterministic success: evidence IDs must show the policy mismatch; report must describe impact and remediation (principle of least privilege + policy fix + regression test).
159
+ - Intermediate rewards: `list_assets()` + `get_log_events()` reveal the mismatch pattern; candidate finding validated.
160
+ - False-positive guardrail: generic “SQLi/RCE” claims penalised unless evidence supports (it won’t, by design).
161
+
162
+ **Stretch tasks (post-V1, not for hackathon critical path).**
163
+ [PROPOSAL] Dependency-risk identification (synthetic CVE mapping), error-handling info leak, prioritisation under strict budget, and multi-finding episodes (2 findings) — but only once the validator-safe V1 is shipped.
164
+
165
+ ## OpenEnv compliance blueprint and repo plan
166
+
167
+ **SECTION 6 — OpenEnv Compliance Blueprint**
168
+
169
+ [SOURCED] OpenEnv’s core contract is Gymnasium-like APIs (`reset()`, `step()`, `state()`), with type-safe models, packaged behind a FastAPI server and typically accessed via an EnvClient. citeturn18view0turn19search1
170
+
171
+ [SOURCED] For environment creators, OpenEnv explicitly supports `openenv init`, and documents a canonical structure: `models.py`, `client.py`, `server/app.py`, `server/<environment>.py`, plus `openenv.yaml` and packaging metadata. citeturn18view0turn18view1
172
+
173
+ [SOURCED] OpenEnv provides CLI commands including `openenv init` and `openenv push` for deploying to **Hugging Face Spaces**. citeturn18view0turn17view0
174
+
175
+ [SOURCED] The OpenEnv repo’s environment-building guide demonstrates typed models (Action/Observation/State) as Python dataclasses and a `create_fastapi_app(...)` helper to serve the environment. citeturn19search1
176
+
177
+ [SOURCED] The OpenEnv repo explicitly warns *not* to copy outdated manifest patterns; current examples use `spec_version`, `type`, `runtime`, `app`, `port`. citeturn19search2turn23view0
178
+
179
+ **Validator-sensitive details you must implement (non-negotiable).**
180
+ [PROPOSAL] Based on official requirements + observed validator behaviour:
181
+ - Provide `openenv.yaml` with `spec_version: 1`, `name`, `runtime: fastapi`, `app: server.app:app`, `port: <int>`, and a `tasks:` list with **≥3 tasks each having `id`, `description`, `grader`**. citeturn23view0turn19search2
182
+ - Ensure each task’s final score is **strictly within (0,1)** to avoid fail-fast validation errors. citeturn27search0turn26view0
183
+ - Implement an `inference.py` that prints `[START]/[STEP]/[END]` lines exactly and uses the OpenAI SDK for LLM calls (if any), reading `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`. citeturn3view6turn22view1turn22view2
184
+ - Provide a `/health` endpoint that returns 200 once ready (commonly used in examples and deployment docs). citeturn17view0turn20view0
185
+
186
+ **Sync vs async.**
187
+ [SOURCED] OpenEnv supports async-first clients with a `.sync()` wrapper for synchronous usage. For hackathon inference scripts, synchronous control flow is often simpler and widely used in examples. citeturn18view0turn22view4
188
+
189
+ **What not to copy from older examples.**
190
+ [SOURCED] Some course material shows a simplified `openenv.yaml` (`name/version/description`), but the repo’s skill guidance explicitly warns against outdated manifests; follow the current spec-style manifest used in validated examples. citeturn19search2turn19search11turn23view0
191
+
192
+ **SECTION 7 — Repo / File Tree Plan**
193
+
194
+ [SOURCED] OpenEnv’s scaffold and common community submissions converge on a predictable repository layout and file naming. citeturn18view0turn20view0turn23view0
195
+
196
+ [PROPOSAL] Recommended repo structure (submission-ready):
197
+
198
+ ```
199
+ secops_evidence_gym/
200
+ openenv.yaml # REQUIRED: spec_version, runtime, app, port, tasks+graders
201
+ pyproject.toml # REQUIRED: package metadata + deps
202
+ README.md # REQUIRED: judging narrative + quickstart + safety boundaries
203
+ inference.py # REQUIRED: strict stdout logs + OpenAI client usage
204
+ models.py # REQUIRED: typed Action/Observation/State dataclasses
205
+ client.py # REQUIRED: EnvClient wrapper (sync + async)
206
+ __init__.py # REQUIRED: export Env + models for pip install
207
+
208
+ server/
209
+ app.py # REQUIRED: create_fastapi_app(...) wiring + /health
210
+ environment.py # REQUIRED: SecOpsEvidenceGymEnvironment(reset/step/state)
211
+ graders.py # REQUIRED: grade_easy/medium/hard + safe_reward clamp
212
+ tasks.py # OPTIONAL (high-leverage): scenario registry + seed sampling
213
+ safety.py # OPTIONAL (high-leverage): tool allowlist + sanitisation helpers
214
+ requirements.txt # OPTIONAL (if Docker build uses it)
215
+ Dockerfile # REQUIRED (practically): HF Spaces docker build
216
+
217
+ tests/
218
+ test_api_contract.py # smoke: reset/step/state doesn’t crash; reward range
219
+ test_graders.py # unit: deterministic scoring + strict (0,1) clamp
220
+ test_seed_determinism.py # unit: same seed → same evidence IDs
221
+ ```
222
+
223
+ [PROPOSAL] Mandatory for hackathon success: `openenv.yaml`, server app wiring, three tasks+graders, Docker build success, `inference.py` with strict logs, and a README that makes the environment’s value obvious in <60 seconds.
224
+
225
+ ## Reward, grading, and anti-hallucination design
226
+
227
+ **SECTION 10 — Reward Design**
228
+
229
+ [SOURCED] OpenEnv leaves reward semantics to the environment; you are responsible for correctness scoring and determinism. citeturn18view0turn19search1
230
+
231
+ [SOURCED] Hackathon validation has shown strict “score must be between 0 and 1 (not 0.0 and not 1.0)” behaviour, and teams clamp rewards (e.g., 0.01–0.99). citeturn27search0turn26view0
232
+
233
+ [SOURCED] Empirical RL research in other domains (e.g., autonomous racing) shows reward design choices materially affect performance and generalisation, supporting the need for careful shaping rather than a single sparse terminal reward. citeturn15view2
234
+
235
+ [PROPOSAL] **Core principle:** correctness is **verifier-gated**, not language-judged. You can optionally add *format/style* checks, but never allow style to dominate correctness reward.
236
+
237
+ ### Reward structure (practical V1)
238
+
239
+ [PROPOSAL] Normalise the final *task score* into `(0,1)` and keep per-step rewards small enough that summed episode reward stays in `(0,1)` as well (or only final reward is used, depending on your environment semantics). Use a single “score” to satisfy the validator and expose detailed breakdowns in `observation.metadata`.
240
+
241
+ **Terminal (sparse) components** ✅
242
+ [PROPOSAL]
243
+ - `+0.60` if at least one ground-truth finding is verified and correctly described (type + impact).
244
+ - `+0.15` if the report includes **≥1 valid evidence ID** per finding and those IDs correspond to the right artefacts.
245
+ - `+0.15` if remediation is actionable (specific control, config, test).
246
+ - `-0.40` per hallucinated/unverified finding claimed in the report.
247
+ - `-0.20` if the agent fails to run `validate_finding()` before `submit_report()`.
248
+
249
+ **Intermediate (dense) components** 🧭
250
+ [PROPOSAL]
251
+ - `+0.02` for discovering a *new* relevant evidence ID (first time only).
252
+ - `+0.03` for creating a well-formed candidate finding that references evidence IDs.
253
+ - `-0.01` per step (efficiency pressure).
254
+ - `-0.03` for repeating the same tool call (exact same args) beyond 2 times.
255
+
256
+ **False-positive penalties / anti-hallucination** 🧯
257
+ [PROPOSAL] A “hallucination” is operationally defined as: the report asserts a finding that is not in the environment’s `verified_findings` list. This is easy to compute deterministically and maps directly to your stated goal (“avoid hallucinating findings”).
258
+
259
+ ### Avoiding reward hacking
260
+
261
+ [PROPOSAL] Hardening rules:
262
+ - Cap rewards from verbosity: extra words do not add points.
263
+ - Make evidence IDs required for high scores (prevents purely rhetorical “security speak”).
264
+ - Penalise calling `validate_finding()` repeatedly without new evidence.
265
+ - Reject “kitchen sink” reporting by penalising extra unverified findings.
266
+
267
+ ### Binary vs shaped reward
268
+
269
+ [PROPOSAL] **Binary-only** (0/1) will be easy to implement but brittle for multi-step tool use; the agent gets no gradient for *how* to investigate efficiently.
270
+
271
+ [PROPOSAL] **Lightly shaped** (recommended) keeps correctness deterministic while providing enough signal to train investigation workflow (evidence collection, validation order, loop avoidance). This mirrors the broader lesson from reward engineering research: shaping and tuning can significantly alter learning outcomes. citeturn15view2
272
+
273
+ ### Deterministic judge vs hybrid judge
274
+
275
+ [PROPOSAL]
276
+ - **Strict deterministic judge (recommended V1):** all correctness via verifiers + string/structure checks.
277
+ - **Hybrid (stretch):** add a small LLM-based style score (e.g., clarity), heavily downweighted (≤0.05 of total) and never affecting pass/fail correctness.
278
+
279
+ ## Baseline inference pipeline and strict stdout logging
280
+
281
+ **SECTION 11 — Baseline Inference Pipeline**
282
+
283
+ [SOURCED] Hackathon requirements include: a reproducible `inference.py`, the OpenAI client requirement for LLM calls (using provided env vars), and strict stdout logging. citeturn3view6
284
+
285
+ [SOURCED] A concrete, hackathon-aligned stdout format has been used by validated submissions (example):
286
+ - `[START] task=<name> env=<benchmark> model=<model_name>`
287
+ - `[STEP] step=<n> action=<str> reward=<0.00> done=<true|false> error=<msg|null>`
288
+ - `[END] task=<name> success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...>` citeturn22view1turn22view2
289
+
290
+ [SOURCED] The same example inference uses the OpenAI SDK, reading `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN`. citeturn22view1turn22view4
291
+
292
+ ### Responsibilities of `inference.py`
293
+
294
+ [PROPOSAL] `inference.py` should:
295
+ - read env vars: `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME`, `ENV_URL` (and optionally `TASK_NAME` override),
296
+ - connect to the env via `.sync()` client,
297
+ - run tasks in a fixed order (easy → medium → hard),
298
+ - execute a bounded number of steps per task,
299
+ - log exactly one `[START]...` per task, one `[END]...` per task, and a `[STEP]...` per environment step,
300
+ - always exit with code 0 (even on failures) and log errors in the `[STEP] error=` field to avoid hard crashes.
301
+
302
+ ### Control flow (V1 baseline strategy)
303
+
304
+ [PROPOSAL] Use a **hybrid baseline** that is reliable under time constraints:
305
+ - scripted tool sequence per task (fast, deterministic),
306
+ - one LLM call (optional) to draft the final report from gathered evidence (so the demo shows “agentic reasoning”),
307
+ - temperature fixed to 0 for reproducibility (and lower variance).
308
+
309
+ [SOURCED] Deterministic inference settings like `TEMPERATURE=0.0` are used in competitive OpenEnv hackathon baselines. citeturn20view0turn22view4
310
+
311
+ ### Minimum viable baseline (must ship)
312
+
313
+ [PROPOSAL] For each task:
314
+ 1) `reset(task_id=<tier>)`
315
+ 2) run 2–4 tool calls that are always relevant (e.g., `check_security_headers`, `search_repo`, etc.)
316
+ 3) `create_finding(...)` using evidence IDs
317
+ 4) `validate_finding(finding_id)`
318
+ 5) `submit_report(report_json)`
319
+
320
+ ### Stronger baseline (only if time permits)
321
+
322
+ [PROPOSAL] Add one planning LLM call that chooses among tools based on the alert type, but still keep a hard step limit, and always include verifier validation before reporting.
323
+
324
+ ## Complete build, validation, deployment, and submission pipeline
325
+
326
+ **SECTION 5 — Complete End-to-End Pipeline**
327
+
328
+ [SOURCED] This pipeline is built to satisfy both OpenEnv conventions (init/push, typed models, FastAPI server) and hackathon validation constraints (tasks/graders, inference logging, runtime budgets). citeturn18view0turn19search2turn3view6turn22view1
329
+
330
+ ### Phase goals, deliverables, verification (execution-ready)
331
+
332
+ [PROPOSAL] The table below is the “do-this-in-order” execution plan. It is intentionally validator-first.
333
+
334
+ | Phase | Goal | Deliverables | Files touched | Acceptance criteria | Main risks | How to verify |
335
+ |---|---|---|---|---|---|---|
336
+ | Scope lock | Freeze V1 to 3 tasks + bounded tools | 1-page spec + non-goals | README.md | No pentest/exploit scope; 3 tasks defined | Scope creep | Manual checklist |
337
+ | Scaffold | Generate OpenEnv skeleton | Working importable package | openenv.yaml, models.py, client.py, server/* | `python -c "import ..."` succeeds | Wrong template/paths | Local import smoke test |
338
+ | Environment core | Implement reset/step/state; tool router | Simulator runs end-to-end | server/environment.py | reset+step returns typed observation; no crashes | Action validation crashes | manual `curl` + python client |
339
+ | Tasks + graders | Implement 3 graders + strict (0,1) clamp | `grade_easy/medium/hard` | server/graders.py, openenv.yaml | tasks discoverable; scores strictly in (0,1) | Validator fail-fast | unit tests + manual checks |
340
+ | Baseline inference | Make inference reproducible + strict logs | inference.py | inference.py | prints correct `[START]/[STEP]/[END]` | log-parser failure | run script locally |
341
+ | Local validation | Run OpenEnv build & validate | passes `openenv validate` | Dockerfile, server/app.py | validate passes locally | port mismatch | `openenv validate --url ...` |
342
+ | Docker + HF | Deploy to Spaces | live endpoint | openenv push output | `/health` 200; reset+step works remotely | HF port/env mismatch | curl + python client |
343
+ | Submission | Final narrative + demo | polished README + screenshots | README.md | demo works in <2 min | unclear story | run “demo script” |
344
+
345
+ ### Concrete build plan with commands
346
+
347
+ [SOURCED] OpenEnv supports `openenv init` and `openenv push` and documents this as the standard creator workflow. citeturn18view0turn17view0
348
+ [SOURCED] The OpenEnv course also provides a grounded dev loop: `uv sync`, `uv run server`, `curl /health`, and Docker build/run commands. citeturn17view0
349
+
350
+ [PROPOSAL] Commands (copy/paste order):
351
+
352
+ 1) **Scaffold**
353
+ ```bash
354
+ pip install openenv-core
355
+ openenv init secops_evidence_gym
356
+ cd secops_evidence_gym
357
+ ```
358
+ [SOURCED] `openenv init` is the documented way to scaffold a new environment. citeturn18view0turn18view2
359
+
360
+ 2) **Local dev install + run**
361
+ ```bash
362
+ uv sync
363
+ uv run server
364
+ curl http://localhost:8000/health
365
+ ```
366
+ [SOURCED] `uv run server` and `/health` checks are part of the recommended iteration loop in OpenEnv course materials. citeturn17view0
367
+
368
+ 3) **Implement core files (edit)**
369
+ - `models.py`: define `Action/Observation/State` dataclasses
370
+ - `server/environment.py`: implement reset/step/state + tool routing
371
+ - `server/graders.py`: implement `grade_easy/grade_medium/grade_hard` + `safe_reward()`
372
+ - `openenv.yaml`: add `tasks:` with grader import paths
373
+
374
+ [SOURCED] OpenEnv’s environment-building guide explicitly directs you to define models and implement `reset/step/state`, then wire a FastAPI app. citeturn19search1
375
+ [SOURCED] A validator-aligned `openenv.yaml` with `spec_version`, `runtime`, `app`, `port`, and `tasks` exists in deep-validation passing examples. citeturn23view0
376
+
377
+ 4) **Build + validate (local)**
378
+ ```bash
379
+ openenv build
380
+ openenv validate --verbose
381
+ ```
382
+ [SOURCED] `openenv build` and `openenv validate` are part of OpenEnv’s recommended validation workflow. citeturn19search2
383
+
384
+ 5) **Docker build/run smoke test**
385
+ ```bash
386
+ docker build -t secops-evidence-gym:latest -f server/Dockerfile .
387
+ docker run -p 8000:8000 secops-evidence-gym:latest
388
+ curl http://localhost:8000/health
389
+ ```
390
+ [SOURCED] This `docker build -f server/Dockerfile .` pattern is directly shown in OpenEnv deployment course material. citeturn17view0
391
+
392
+ 6) **Run inference locally**
393
+ ```bash
394
+ export HF_TOKEN="..."
395
+ export API_BASE_URL="..."
396
+ export MODEL_NAME="..."
397
+ export ENV_URL="http://localhost:8000"
398
+ python inference.py
399
+ ```
400
+ [SOURCED] These env var names and OpenAI SDK usage are consistent with hackathon guidance and existing inference implementations. citeturn3view6turn22view4
401
+
402
+ 7) **Deploy to Hugging Face Spaces**
403
+ ```bash
404
+ openenv push --repo-id <your-hf-username>/secops-evidence-gym
405
+ ```
406
+ [SOURCED] `openenv push` is described as the fastest path to deploy to **Hugging Face Spaces**. citeturn17view0turn18view0
407
+
408
+ ### Testing and validation plan (high-signal)
409
+
410
+ [SOURCED] OpenEnv stresses predictable API behaviour and type-safe contracts; hackathon validation is fail-fast. citeturn18view0turn27search0
411
+
412
+ [PROPOSAL] Test layers (in priority order):
413
+ - **API contract smoke tests:** reset/step/state return valid JSON; never crash on invalid tool name (should return an observation with an error field).
414
+ - **Grader tests:** for each task, verify (a) correctness cases score high, (b) hallucination cases score low, (c) score always ∈ (0,1).
415
+ - **Seed determinism tests:** same `seed` produces same evidence IDs and same verifier outputs.
416
+ - **Runtime test:** run `inference.py` end-to-end and assert wall-clock < 2 minutes locally; assume < 20 minutes on grader infra even with cold starts. citeturn3view6turn22view4
417
+ - **Reward sanity tests:** ensure reward increases monotonically with verified correctness; fails if verbosity alone increases reward.
418
+
419
+ ## Submission packaging, execution roadmap, real-world usefulness, and failure modes
420
+
421
+ **SECTION 14 — README / Demo / Submission Narrative**
422
+ [SOURCED] Judges likely assess both the environment’s technical correctness (programmatic checks) and qualitative merit (LLM scoring / narrative). citeturn3view7
423
+
424
+ [PROPOSAL] README structure that “feels like a winner” 🏆:
425
+ - **Hero block:** one-paragraph pitch + why it’s real-world + safety claim.
426
+ - **Two-minute demo:** copy/paste commands + expected output snippet with `[START]/[STEP]/[END]`.
427
+ - **Environment contract:** action schema, observation schema, task list.
428
+ - **Grading:** explain deterministic verifiers + hallucination penalties.
429
+ - **Safety & isolation:** explicit exclusions (no egress, no shell, synthetic artefacts).
430
+ - **Real-world relevance:** how this benchmarks/reporting maps to security workflows (triage, evidence, remediation).
431
+ - **Screenshots:** web UI (optional) + an evidence trace + one scored report example.
432
+
433
+ **SECTION 15 — Project Management Plan**
434
+ [PROPOSAL] Day-by-day (assuming a hackathon-style sprint):
435
+
436
+ - **Day 0 (scope lock + scaffold):** environment skeleton, `openenv.yaml` with 3 tasks, stub graders returning 0.5 (clamped), server runs locally.
437
+ - **Day 1 (determinism + validator):** implement scenario generator, evidence registry, verifiers, and strict (0,1) scoring; pass `openenv validate`.
438
+ - **Day 2 (baseline + polish):** implement `inference.py` strict logs; deploy to Spaces; polish README + demo artefacts.
439
+
440
+ [PROPOSAL] Critical path: `openenv.yaml tasks+graders` → grader clamp `(0,1)` → inference stdout format → Docker+Spaces deployment. (Everything else is secondary.)
441
+
442
+ **SECTION 16 — Real-World Usefulness Plan**
443
+ [SOURCED] NIST’s testing guide emphasises planning, conducting tests, analysing findings, and developing mitigation strategies; your environment’s “evidence → remediation” focus aligns with that lifecycle without requiring offensive exploitation. citeturn29search8turn29search0
444
+
445
+ [PROPOSAL] Who would care after the hackathon:
446
+ - security engineering teams evaluating agentic “triage + reporting” reliability,
447
+ - LLM tooling teams wanting benchmarks for **non-hallucinating, evidence-grounded** outputs,
448
+ - training teams building safe cyber ranges (without weaponisation).
449
+
450
+ [PROPOSAL] Post-hackathon upgrades (highest leverage):
451
+ - export trajectories as JSONL for offline training,
452
+ - add more scenario families (still safe) and a held-out split for generalisation,
453
+ - integrate with RL trainers (e.g., TRL’s OpenEnv integration) to show real training curves. citeturn19search6turn10view0
454
+
455
+ [SOURCED] PenGym provides evidence that realism/faithfulness of environments can affect transfer and stability when moving from simulation to more realistic settings—so you should roadmap a “higher fidelity mode” (still safe) later, not in V1. citeturn15view0
456
+
457
+ **SECTION 17 — Why the naive version would fail**
458
+ [PROPOSAL] Top failure patterns (and why they kill submissions):
459
+ - Too broad (full cyber range, live services): fails time/infra constraints. citeturn3view6turn10view0
460
+ - Fuzzy grading (LLM-only judging): non-deterministic, easy to game.
461
+ - Unbounded tools (shell/network): unsafe + untrainable action space.
462
+ - Scores at exactly 0.0 or 1.0: fail-fast “out of range” validator. citeturn27search0turn26view0
463
+ - Inference logs not parseable: phase-1 failure even if env is good. citeturn3view6turn22view1
464
+ - Port / health issues on Spaces: container “works locally” but fails remotely. citeturn17view0turn20view0
465
+
466
+ **SECTION 18 — Final Recommendation**
467
+
468
+ [PROPOSAL] **What should you build?**
469
+ Build **SecOps Evidence Gym**: a deterministic, safe, sandbox-only cyber analyst environment focused on evidence collection, verifier validation, and remediation reporting.
470
+
471
+ [PROPOSAL] **What should V1 include? (minimum winning set)**
472
+ - OpenEnv-compliant FastAPI env with typed models and `reset/step/state`. citeturn18view0turn19search1
473
+ - `openenv.yaml` with **3 tasks + graders**. citeturn23view0turn3view6
474
+ - Deterministic verifiers + strict score clamp to `(0,1)`. citeturn27search0turn26view0
475
+ - Baseline `inference.py` with strict `[START]/[STEP]/[END]` logging + OpenAI SDK usage for any LLM calls. citeturn3view6turn22view1turn22view4
476
+ - HF Spaces deployment with a working `/health`. citeturn17view0turn20view0
477
+
478
+ [PROPOSAL] **What should you cut?**
479
+ - Any real pentesting/offensive content, any arbitrary command execution, any live targets, any correctness scoring via an LLM judge.
480
+
481
+ [PROPOSAL] **Top 5 implementation decisions that matter most**
482
+ 1) Validator-safe `openenv.yaml` tasks+graders wiring. citeturn23view0
483
+ 2) Score/range compliance: clamp to `(0,1)` everywhere. citeturn27search0turn26view0
484
+ 3) Strict stdout format in `inference.py`. citeturn22view1turn22view2
485
+ 4) Deterministic verifiers as the source of truth.
486
+ 5) Bounded tool set (≤8 tools) with anti-loop penalties.
487
+
488
+ [PROPOSAL] **Minimum viable winning submission**
489
+ A V1 with 3 tasks, deterministic graders, bounded tools, strict inference logging, and a polished README + demo trace.
490
+
491
+ [PROPOSAL] **Minimum viable real-world useful submission**
492
+ The same V1, plus: seed determinism, trajectory export, and a clear “how to add new scenarios” contributor guide.
493
+
494
+ [PROPOSAL] **If you only have time for 20% of ambition—do this exact 20%:**
495
+ - Implement **one** robust multi-step loop (tools → validate → report)
496
+ - Implement **exactly 3** tasks (easy/medium/hard)
497
+ - Make graders deterministic and validator-safe
498
+ - Make deployment + inference bulletproof
499
+ Everything else is stretch.
500
+
501
+ **Confidence (my estimate): 8.4/10** ✅🔥
502
+
503
+ ## Sources and credibility ratings (with exact links)
504
+
505
+ [SOURCED] Ratings are my judgement of authority + relevance for this hackathon context (0–10). URLs are provided verbatim in code form.
506
+
507
+ ### Tier 1 (official OpenEnv + hackathon dashboard)
508
+ - Credibility **9.5/10** — `https://github.com/meta-pytorch/OpenEnv` citeturn18view0
509
+ - Credibility **9.0/10** — `https://github.com/meta-pytorch/OpenEnv/blob/main/envs/README.md` citeturn19search1
510
+ - Credibility **8.5/10** — `https://github.com/meta-pytorch/OpenEnv/blob/main/.claude/skills/generate-openenv-env/SKILL.md` citeturn19search2
511
+ - Credibility **9.0/10** — `https://www.scaler.com/school-of-technology/meta-pytorch-hackathon/dashboard` citeturn1view0turn3view6turn3view7
512
+
513
+ ### Tier 2 (strong community exemplars)
514
+ - Credibility **8.5/10** — `https://github.com/sid-rp/kube-sre-gym` citeturn10view0
515
+ - Credibility **8.0/10** — `https://huggingface.co/openenv-community` citeturn14view0
516
+ - Credibility **7.5/10** — `https://github.com/Harikishanth/Incident-Triage-Environment` citeturn20view0turn23view0turn22view1
517
+
518
+ ### Tier 3 (peer-reviewed / primary references for design constraints)
519
+ - Credibility **8.5/10** — PenGym (Computers & Security, open access): `https://www.sciencedirect.com/science/article/pii/S0167404824004450` citeturn15view0
520
+ - Credibility **8.0/10** — Reward design + generalisation (Scientific Reports, 2025): `https://www.nature.com/articles/s41598-025-27702-6` citeturn15view2
521
+ - Credibility **8.5/10** — AMaze (JOSS, 2025): `https://joss.theoj.org/papers/10.21105/joss.07208` citeturn16search7
522
+ - Credibility **9.5/10** — NIST SP 800-115: `https://csrc.nist.gov/pubs/sp/800/115/final` citeturn29search8
523
+ - Credibility **9.0/10** — NIST “Cyber Range: A Guide” (PDF landing): `https://www.nist.gov/document/cyber-range` citeturn29search1
524
+ - Credibility **7.5/10** — “Cybersecurity of Cyber Ranges: Threats and Mitigations” (IJISR, 2022 PDF): `https://infonomics-society.org/wp-content/uploads/Cybersecurity-of-Cyber-Ranges.pdf` citeturn29search2
525
+
526
+ ### Tier 4 (useful validator “ground truth” signals from the field)
527
+ - Credibility **6.5/10** — Validator failure mode discussion (score must be strictly between 0 and 1): `https://www.reddit.com/r/pytorch/comments/1shi767/meta_x_pytorch_x_sst_x_openenv_hackathon_phase_2/` citeturn27search0
528
+ - Credibility **7.0/10** — Strict logging format reference via a verified submission’s `inference.py`: `https://github.com/Harikishanth/Incident-Triage-Environment/blob/main/inference.py` citeturn22view1turn22view2
529
+
530
+ ### Uploaded reference you provided
531
+ - Credibility **7.0/10** (useful as a design draft; not independently authoritative) — `deep-research-report (2).md` fileciteturn2file0
inference.py ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Strict-log baseline inference for the Cyber Analyst OpenEnv environment."""
3
+
4
+ from __future__ import annotations
5
+
6
+ import os
7
+ import sys
8
+ from dataclasses import dataclass
9
+ from pathlib import Path
10
+ from typing import Any
11
+
12
+ from openai import OpenAI
13
+
14
+ PACKAGE_PARENT = Path(__file__).resolve().parent.parent
15
+ if str(PACKAGE_PARENT) not in sys.path:
16
+ sys.path.insert(0, str(PACKAGE_PARENT))
17
+
18
+ from Cyber_analyst import CyberAnalystAction, CyberAnalystEnv
19
+
20
+
21
+ ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
22
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
23
+ MODEL_NAME = os.getenv("MODEL_NAME", "openai/gpt-oss-120b:novita")
24
+ API_KEY = (
25
+ os.getenv("API_KEY") or os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY") or ""
26
+ )
27
+ TEMPERATURE = float(os.getenv("TEMPERATURE", "0.0"))
28
+ SEED = int(os.getenv("SEED", "7"))
29
+
30
+
31
+ @dataclass(frozen=True)
32
+ class LLMConfig:
33
+ base_url: str
34
+ model_name: str
35
+ api_key_present: bool
36
+ temperature: float
37
+
38
+
39
+ def build_llm_config() -> LLMConfig:
40
+ return LLMConfig(
41
+ base_url=API_BASE_URL,
42
+ model_name=MODEL_NAME,
43
+ api_key_present=bool(API_KEY),
44
+ temperature=TEMPERATURE,
45
+ )
46
+
47
+
48
+ def build_openai_client() -> OpenAI | None:
49
+ """Return an OpenAI-compatible client for HF router or OpenAI endpoints."""
50
+
51
+ if not API_KEY:
52
+ return None
53
+ return OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
54
+
55
+
56
+ def task_plan(task_id: str) -> list[CyberAnalystAction]:
57
+ if task_id == "secret_exposure_easy":
58
+ report = {
59
+ "findings": [
60
+ {
61
+ "finding_type": "secret_exposure",
62
+ "evidence_ids": ["EVID-101"],
63
+ "impact": "A synthetic API key-like secret is present in config.",
64
+ "remediation": "Remove the synthetic key from config and rotate the credential.",
65
+ }
66
+ ]
67
+ }
68
+ return [
69
+ CyberAnalystAction(tool_name="search_repo", args={"query": "api key"}),
70
+ CyberAnalystAction(
71
+ tool_name="create_finding",
72
+ args={
73
+ "finding_type": "secret_exposure",
74
+ "evidence_ids": ["EVID-101"],
75
+ "severity_guess": "high",
76
+ "remediation": "Remove the key and rotate the credential.",
77
+ },
78
+ ),
79
+ CyberAnalystAction(tool_name="validate_finding", args={"finding_id": "FND-001"}),
80
+ CyberAnalystAction(tool_name="submit_report", args={"report_json": report}),
81
+ ]
82
+
83
+ if task_id == "missing_security_headers_medium":
84
+ report = {
85
+ "findings": [
86
+ {
87
+ "finding_type": "missing_security_headers",
88
+ "evidence_ids": ["EVID-201"],
89
+ "impact": "Gateway responses are missing HSTS and CSP headers.",
90
+ "remediation": "Add HSTS and CSP header policy at the gateway.",
91
+ }
92
+ ]
93
+ }
94
+ return [
95
+ CyberAnalystAction(
96
+ tool_name="check_security_headers", args={"service_id": "gateway"}
97
+ ),
98
+ CyberAnalystAction(
99
+ tool_name="create_finding",
100
+ args={
101
+ "finding_type": "missing_security_headers",
102
+ "evidence_ids": ["EVID-201"],
103
+ "severity_guess": "medium",
104
+ "remediation": "Add HSTS and CSP response headers at the gateway.",
105
+ },
106
+ ),
107
+ CyberAnalystAction(tool_name="validate_finding", args={"finding_id": "FND-001"}),
108
+ CyberAnalystAction(tool_name="submit_report", args={"report_json": report}),
109
+ ]
110
+
111
+ report = {
112
+ "findings": [
113
+ {
114
+ "finding_type": "authz_boundary_misconfiguration",
115
+ "evidence_ids": ["EVID-301", "EVID-302"],
116
+ "impact": "The admin export route allows an analyst role outside the intended admin boundary.",
117
+ "remediation": "Apply least privilege in the policy and add a regression test for the route.",
118
+ }
119
+ ]
120
+ }
121
+ return [
122
+ CyberAnalystAction(tool_name="list_assets", args={}),
123
+ CyberAnalystAction(
124
+ tool_name="get_log_events",
125
+ args={"service_id": "admin-service", "query": "admin export"},
126
+ ),
127
+ CyberAnalystAction(tool_name="search_repo", args={"query": "admin export"}),
128
+ CyberAnalystAction(
129
+ tool_name="create_finding",
130
+ args={
131
+ "finding_type": "authz_boundary_misconfiguration",
132
+ "evidence_ids": ["EVID-301", "EVID-302"],
133
+ "severity_guess": "critical",
134
+ "remediation": "Apply least privilege in policy and add a regression test.",
135
+ },
136
+ ),
137
+ CyberAnalystAction(tool_name="validate_finding", args={"finding_id": "FND-001"}),
138
+ CyberAnalystAction(tool_name="submit_report", args={"report_json": report}),
139
+ ]
140
+
141
+
142
+ def log_start(task_id: str, llm_config: LLMConfig) -> None:
143
+ print(
144
+ f"[START] task={task_id} env=Cyber_analyst model={llm_config.model_name}",
145
+ flush=True,
146
+ )
147
+
148
+
149
+ def log_step(
150
+ step: int, action: CyberAnalystAction, reward: float | None, done: bool, error: str
151
+ ) -> None:
152
+ reward_value = 0.0 if reward is None else float(reward)
153
+ error_value = error if error else "null"
154
+ print(
155
+ f"[STEP] step={step} action={action.tool_name} "
156
+ f"reward={reward_value:.2f} done={str(done).lower()} error={error_value}",
157
+ flush=True,
158
+ )
159
+
160
+
161
+ def log_end(task_id: str, success: bool, steps: int, score: float, rewards: list[float]) -> None:
162
+ rewards_text = ",".join(f"{reward:.2f}" for reward in rewards)
163
+ print(
164
+ f"[END] task={task_id} success={str(success).lower()} "
165
+ f"steps={steps} score={score:.2f} rewards={rewards_text}",
166
+ flush=True,
167
+ )
168
+
169
+
170
+ def run_task(task_id: str, llm_config: LLMConfig) -> None:
171
+ log_start(task_id, llm_config)
172
+ rewards: list[float] = []
173
+ final_score = 0.01
174
+ success = False
175
+
176
+ try:
177
+ with CyberAnalystEnv(base_url=ENV_URL).sync() as env:
178
+ reset_result = env.reset(task_id=task_id, seed=SEED)
179
+ rewards.append(float(reset_result.reward or 0.0))
180
+
181
+ for index, action in enumerate(task_plan(task_id), start=1):
182
+ result = env.step(action)
183
+ obs = result.observation
184
+ reward = float(result.reward or 0.0)
185
+ rewards.append(reward)
186
+ log_step(index, action, result.reward, result.done, obs.error)
187
+ if result.done:
188
+ final_score = float(obs.tool_result.get("score", reward))
189
+ success = final_score > 0.5
190
+ break
191
+ except Exception as exc:
192
+ log_step(0, CyberAnalystAction(tool_name="runtime_error", args={}), 0.01, True, str(exc))
193
+
194
+ log_end(task_id, success, max(0, len(rewards) - 1), final_score, rewards)
195
+
196
+
197
+ def main() -> None:
198
+ llm_config = build_llm_config()
199
+ _ = build_openai_client()
200
+ task_override = os.getenv("TASK_NAME")
201
+ task_ids = (
202
+ [task_override]
203
+ if task_override
204
+ else [
205
+ "secret_exposure_easy",
206
+ "missing_security_headers_medium",
207
+ "authz_boundary_hard",
208
+ ]
209
+ )
210
+ for task_id in task_ids:
211
+ run_task(task_id, llm_config)
212
+
213
+
214
+ if __name__ == "__main__":
215
+ main()
models.py ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """Typed models for the Cyber Analyst OpenEnv environment."""
8
+
9
+ from typing import Any
10
+
11
+ from openenv.core.env_server.types import Action, Observation, State
12
+ from pydantic import Field
13
+
14
+
15
+ class CyberAnalystAction(Action):
16
+ """A bounded simulator tool call."""
17
+
18
+ tool_name: str = Field(..., description="Name of the approved simulator tool")
19
+ args: dict[str, Any] = Field(
20
+ default_factory=dict,
21
+ description="Tool arguments. The environment ignores unsupported keys.",
22
+ )
23
+
24
+
25
+ class CyberAnalystObservation(Observation):
26
+ """Observation returned after reset or an environment step."""
27
+
28
+ task_id: str = Field(default="", description="Current benchmark task id")
29
+ alert: str = Field(default="", description="Initial alert or task prompt")
30
+ phase: str = Field(default="investigate", description="Current episode phase")
31
+ tool_catalog: list[dict[str, Any]] = Field(
32
+ default_factory=list, description="Approved tools and their schemas"
33
+ )
34
+ tool_result: dict[str, Any] = Field(
35
+ default_factory=dict, description="Result returned by the latest tool call"
36
+ )
37
+ evidence_ids: list[str] = Field(
38
+ default_factory=list, description="Evidence ids discovered so far"
39
+ )
40
+ verified_findings: list[dict[str, Any]] = Field(
41
+ default_factory=list, description="Verifier-confirmed findings"
42
+ )
43
+ candidate_findings: list[dict[str, Any]] = Field(
44
+ default_factory=list, description="Candidate findings created by the agent"
45
+ )
46
+ step_budget_remaining: int = Field(
47
+ default=0, ge=0, description="Steps remaining before timeout"
48
+ )
49
+ score_breakdown: dict[str, Any] = Field(
50
+ default_factory=dict, description="Deterministic reward/score explanation"
51
+ )
52
+ error: str = Field(default="", description="Non-fatal environment error, if any")
53
+
54
+
55
+ class CyberAnalystState(State):
56
+ """State summary exposed via the OpenEnv state endpoint."""
57
+
58
+ task_id: str = Field(default="", description="Current benchmark task id")
59
+ seed: int | None = Field(default=None, description="Current deterministic seed")
60
+ phase: str = Field(default="investigate", description="Current episode phase")
61
+ step_budget_remaining: int = Field(default=0, ge=0)
62
+ recent_evidence_ids: list[str] = Field(default_factory=list)
63
+ verified_finding_ids: list[str] = Field(default_factory=list)
64
+ done: bool = Field(default=False)
openenv.yaml ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ spec_version: 1
2
+ name: Cyber_analyst
3
+ type: space
4
+ runtime: fastapi
5
+ app: server.app:app
6
+ port: 8000
7
+ tasks:
8
+ - id: secret_exposure_easy
9
+ description: Detect a leaked synthetic API key in a repo snapshot and submit rotation/removal remediation.
10
+ grader: server.graders:grade_secret_exposure_easy
11
+ - id: missing_security_headers_medium
12
+ description: Detect missing HSTS/CSP headers in a synthetic gateway header snapshot and submit remediation.
13
+ grader: server.graders:grade_missing_security_headers_medium
14
+ - id: authz_boundary_hard
15
+ description: Detect an admin route role-policy mismatch and submit least-privilege remediation.
16
+ grader: server.graders:grade_authz_boundary_hard
openenv_Cyber_analyst.egg-info/PKG-INFO ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ Metadata-Version: 2.4
2
+ Name: openenv-Cyber_analyst
3
+ Version: 0.1.0
4
+ Summary: Cyber Analyst environment for OpenEnv
5
+ Requires-Python: >=3.10
6
+ Requires-Dist: openenv-core[core]>=0.2.2
7
+ Requires-Dist: openai>=1.0.0
8
+ Provides-Extra: dev
9
+ Requires-Dist: pytest>=8.0.0; extra == "dev"
10
+ Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
openenv_Cyber_analyst.egg-info/SOURCES.txt ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ README.md
2
+ __init__.py
3
+ client.py
4
+ inference.py
5
+ models.py
6
+ pyproject.toml
7
+ ./__init__.py
8
+ ./client.py
9
+ ./inference.py
10
+ ./models.py
11
+ openenv_Cyber_analyst.egg-info/PKG-INFO
12
+ openenv_Cyber_analyst.egg-info/SOURCES.txt
13
+ openenv_Cyber_analyst.egg-info/dependency_links.txt
14
+ openenv_Cyber_analyst.egg-info/entry_points.txt
15
+ openenv_Cyber_analyst.egg-info/requires.txt
16
+ openenv_Cyber_analyst.egg-info/top_level.txt
17
+ server/Cyber_analyst_environment.py
18
+ server/__init__.py
19
+ server/app.py
20
+ server/graders.py
21
+ server/tasks.py
22
+ tests/test_environment.py
openenv_Cyber_analyst.egg-info/dependency_links.txt ADDED
@@ -0,0 +1 @@
 
 
1
+
openenv_Cyber_analyst.egg-info/entry_points.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ [console_scripts]
2
+ server = Cyber_analyst.server.app:main
openenv_Cyber_analyst.egg-info/requires.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ openenv-core[core]>=0.2.2
2
+ openai>=1.0.0
3
+
4
+ [dev]
5
+ pytest>=8.0.0
6
+ pytest-cov>=4.0.0
openenv_Cyber_analyst.egg-info/top_level.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ Cyber_analyst
pyproject.toml ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ [build-system]
8
+ requires = ["setuptools>=45", "wheel"]
9
+ build-backend = "setuptools.build_meta"
10
+
11
+ [project]
12
+ name = "openenv-Cyber_analyst"
13
+ version = "0.1.0"
14
+ description = "Cyber Analyst environment for OpenEnv"
15
+ requires-python = ">=3.10"
16
+ dependencies = [
17
+ # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
18
+ # install from github
19
+ # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
20
+ "openenv-core[core]>=0.2.2",
21
+ # Environment-specific dependencies
22
+ "openai>=1.0.0",
23
+ ]
24
+
25
+ [project.optional-dependencies]
26
+ dev = [
27
+ "pytest>=8.0.0",
28
+ "pytest-cov>=4.0.0",
29
+ ]
30
+
31
+ [project.scripts]
32
+ # Server entry point - enables running via: uv run --project . server
33
+ # or: python -m Cyber_analyst.server.app
34
+ server = "Cyber_analyst.server.app:main"
35
+
36
+ [tool.setuptools]
37
+ include-package-data = true
38
+ packages = ["Cyber_analyst", "Cyber_analyst.server"]
39
+ package-dir = { "Cyber_analyst" = ".", "Cyber_analyst.server" = "server" }
server/Cyber_analyst_environment.py ADDED
@@ -0,0 +1,506 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """SecOps Evidence Gym environment implementation."""
8
+
9
+ from __future__ import annotations
10
+
11
+ import hashlib
12
+ import json
13
+ from collections import Counter
14
+ from typing import Any
15
+ from uuid import uuid4
16
+
17
+ from openenv.core.env_server.interfaces import Environment
18
+
19
+ try:
20
+ from ..models import (
21
+ CyberAnalystAction,
22
+ CyberAnalystObservation,
23
+ CyberAnalystState,
24
+ )
25
+ from .graders import safe_reward, score_report
26
+ from .tasks import DEFAULT_TASK_ID, TOOL_CATALOG, build_scenario
27
+ except ImportError: # pragma: no cover - supports direct module execution
28
+ from models import CyberAnalystAction, CyberAnalystObservation, CyberAnalystState
29
+ from server.graders import safe_reward, score_report
30
+ from server.tasks import DEFAULT_TASK_ID, TOOL_CATALOG, build_scenario
31
+
32
+
33
+ class CyberAnalystEnvironment(
34
+ Environment[CyberAnalystAction, CyberAnalystObservation, CyberAnalystState]
35
+ ):
36
+ """A safe, deterministic evidence-grounded cyber analyst benchmark."""
37
+
38
+ SUPPORTS_CONCURRENT_SESSIONS: bool = True
39
+ MAX_STEPS = 12
40
+ REPEAT_HARD_STOP = 6
41
+
42
+ def __init__(self):
43
+ super().__init__()
44
+ self._scenario: dict[str, Any] = {}
45
+ self._state = CyberAnalystState()
46
+ self._discovered_evidence: set[str] = set()
47
+ self._candidate_findings: dict[str, dict[str, Any]] = {}
48
+ self._verified_findings: list[dict[str, Any]] = []
49
+ self._validated_finding_ids: set[str] = set()
50
+ self._action_counts: Counter[str] = Counter()
51
+ self._last_score_breakdown: dict[str, Any] = {}
52
+ self._trajectory_events: list[dict[str, Any]] = []
53
+ self._initialize_episode(DEFAULT_TASK_ID, seed=None, episode_id=None)
54
+
55
+ def reset(
56
+ self,
57
+ seed: int | None = None,
58
+ episode_id: str | None = None,
59
+ task_id: str = DEFAULT_TASK_ID,
60
+ **_: Any,
61
+ ) -> CyberAnalystObservation:
62
+ """Reset the selected deterministic task."""
63
+
64
+ self._initialize_episode(task_id=task_id, seed=seed, episode_id=episode_id)
65
+ tool_result = {
66
+ "message": "Cyber Analyst environment ready.",
67
+ "allowed_scope": "Synthetic artifacts only. No live targets or shell.",
68
+ }
69
+ obs = self._observation(
70
+ tool_result={
71
+ **tool_result,
72
+ "trajectory_jsonl": self.export_trajectory_jsonl(),
73
+ },
74
+ reward=0.01,
75
+ )
76
+ self._record_trajectory("reset", None, tool_result, obs.reward, obs.done, obs.error)
77
+ return obs
78
+
79
+ def step( # type: ignore[override]
80
+ self,
81
+ action: CyberAnalystAction,
82
+ timeout_s: float | None = None,
83
+ **_: Any,
84
+ ) -> CyberAnalystObservation:
85
+ """Execute one bounded simulator tool call."""
86
+
87
+ del timeout_s
88
+
89
+ if self._state.done:
90
+ tool_result = {"message": "Episode is already complete."}
91
+ obs = self._observation(
92
+ tool_result=tool_result,
93
+ reward=0.01,
94
+ done=True,
95
+ error="episode_already_done",
96
+ )
97
+ self._record_trajectory("step", action, tool_result, obs.reward, obs.done, obs.error)
98
+ return obs
99
+
100
+ self._state.step_count += 1
101
+ self._state.step_budget_remaining = max(
102
+ 0, self.MAX_STEPS - self._state.step_count
103
+ )
104
+
105
+ signature = self._action_signature(action)
106
+ self._action_counts[signature] += 1
107
+ repeat_count = self._action_counts[signature]
108
+
109
+ if repeat_count >= self.REPEAT_HARD_STOP:
110
+ self._state.phase = "done"
111
+ self._state.done = True
112
+ self._last_score_breakdown = {
113
+ "score": 0.03,
114
+ "repeat_hard_stop": True,
115
+ "signature": signature,
116
+ }
117
+ tool_result = {"message": "Episode stopped after repeated identical actions."}
118
+ obs = self._observation(
119
+ tool_result=tool_result,
120
+ reward=0.03,
121
+ done=True,
122
+ error="repeat_hard_stop",
123
+ )
124
+ self._record_trajectory("step", action, tool_result, obs.reward, obs.done, obs.error)
125
+ return obs
126
+
127
+ handler = getattr(self, f"_tool_{action.tool_name}", None)
128
+ if handler is None:
129
+ tool_result = {
130
+ "ok": False,
131
+ "message": f"Unsupported tool: {action.tool_name}",
132
+ "available_tools": [tool["name"] for tool in TOOL_CATALOG],
133
+ }
134
+ obs = self._step_observation(
135
+ tool_result=tool_result,
136
+ repeat_count=repeat_count,
137
+ error="unsupported_tool",
138
+ )
139
+ self._record_trajectory("step", action, tool_result, obs.reward, obs.done, obs.error)
140
+ return obs
141
+
142
+ try:
143
+ result, reward_delta, done = handler(action.args)
144
+ error = ""
145
+ except Exception as exc: # pragma: no cover - defensive rollout guard
146
+ result = {"ok": False, "message": str(exc)}
147
+ reward_delta = -0.05
148
+ done = False
149
+ error = exc.__class__.__name__
150
+
151
+ if self._state.step_budget_remaining <= 0 and not done:
152
+ done = True
153
+ self._state.phase = "done"
154
+ self._state.done = True
155
+ result = {
156
+ **result,
157
+ "timeout": True,
158
+ "message": "Step budget exhausted before report submission.",
159
+ }
160
+ reward_delta -= 0.10
161
+
162
+ obs = self._step_observation(
163
+ tool_result=result,
164
+ repeat_count=repeat_count,
165
+ reward_delta=reward_delta,
166
+ done=done,
167
+ error=error,
168
+ )
169
+ self._record_trajectory("step", action, result, obs.reward, obs.done, obs.error)
170
+ return obs
171
+
172
+ @property
173
+ def state(self) -> CyberAnalystState:
174
+ """Return the current episode state summary."""
175
+
176
+ return self._state
177
+
178
+ def _initialize_episode(
179
+ self, task_id: str, seed: int | None, episode_id: str | None
180
+ ) -> None:
181
+ self._scenario = build_scenario(task_id, seed)
182
+ self._discovered_evidence = set()
183
+ self._candidate_findings = {}
184
+ self._verified_findings = []
185
+ self._validated_finding_ids = set()
186
+ self._action_counts = Counter()
187
+ self._last_score_breakdown = {}
188
+ self._trajectory_events = []
189
+ self._state = CyberAnalystState(
190
+ episode_id=episode_id or str(uuid4()),
191
+ step_count=0,
192
+ task_id=self._scenario["task_id"],
193
+ seed=seed,
194
+ phase="investigate",
195
+ step_budget_remaining=self.MAX_STEPS,
196
+ recent_evidence_ids=[],
197
+ verified_finding_ids=[],
198
+ done=False,
199
+ )
200
+
201
+ def export_trajectory_jsonl(self) -> str:
202
+ """Return the current episode trajectory as JSONL for offline analysis."""
203
+
204
+ return "\n".join(
205
+ json.dumps(event, sort_keys=True, default=str)
206
+ for event in self._trajectory_events
207
+ )
208
+
209
+ def _record_trajectory(
210
+ self,
211
+ event_type: str,
212
+ action: CyberAnalystAction | None,
213
+ tool_result: dict[str, Any],
214
+ reward: float | int | None,
215
+ done: bool,
216
+ error: str,
217
+ ) -> None:
218
+ action_payload = None
219
+ if action is not None:
220
+ action_payload = action.model_dump(exclude_none=True)
221
+ self._trajectory_events.append(
222
+ {
223
+ "episode_id": self._state.episode_id,
224
+ "task_id": self._state.task_id,
225
+ "seed": self._state.seed,
226
+ "event_type": event_type,
227
+ "step": self._state.step_count,
228
+ "phase": self._state.phase,
229
+ "action": action_payload,
230
+ "tool_result": tool_result,
231
+ "evidence_ids": sorted(self._discovered_evidence),
232
+ "verified_finding_ids": list(self._state.verified_finding_ids),
233
+ "reward": reward,
234
+ "done": done,
235
+ "error": error,
236
+ }
237
+ )
238
+
239
+ def _observation(
240
+ self,
241
+ tool_result: dict[str, Any] | None = None,
242
+ reward: float = 0.01,
243
+ done: bool | None = None,
244
+ error: str = "",
245
+ ) -> CyberAnalystObservation:
246
+ done_value = self._state.done if done is None else done
247
+ return CyberAnalystObservation(
248
+ task_id=self._scenario.get("task_id", ""),
249
+ alert=self._scenario.get("alert", ""),
250
+ phase=self._state.phase,
251
+ tool_catalog=TOOL_CATALOG,
252
+ tool_result=tool_result or {},
253
+ evidence_ids=sorted(self._discovered_evidence),
254
+ verified_findings=list(self._verified_findings),
255
+ candidate_findings=list(self._candidate_findings.values()),
256
+ step_budget_remaining=self._state.step_budget_remaining,
257
+ score_breakdown=dict(self._last_score_breakdown),
258
+ error=error,
259
+ done=done_value,
260
+ reward=safe_reward(reward),
261
+ )
262
+
263
+ def _step_observation(
264
+ self,
265
+ tool_result: dict[str, Any],
266
+ repeat_count: int,
267
+ reward_delta: float = 0.0,
268
+ done: bool = False,
269
+ error: str = "",
270
+ ) -> CyberAnalystObservation:
271
+ reward = 0.04 + reward_delta - 0.01
272
+ if repeat_count > 2:
273
+ reward -= 0.03 * (repeat_count - 2)
274
+
275
+ if done:
276
+ self._state.phase = "done"
277
+ self._state.done = True
278
+
279
+ self._state.recent_evidence_ids = sorted(self._discovered_evidence)[-5:]
280
+ self._state.verified_finding_ids = [
281
+ finding["finding_id"] for finding in self._verified_findings
282
+ ]
283
+
284
+ return self._observation(
285
+ tool_result=tool_result,
286
+ reward=safe_reward(reward),
287
+ done=self._state.done,
288
+ error=error,
289
+ )
290
+
291
+ def _action_signature(self, action: CyberAnalystAction) -> str:
292
+ payload = {
293
+ "tool_name": action.tool_name,
294
+ "args": action.args,
295
+ }
296
+ encoded = json.dumps(payload, sort_keys=True, default=str)
297
+ return hashlib.sha256(encoded.encode("utf-8")).hexdigest()[:16]
298
+
299
+ def _record_evidence(self, evidence_ids: list[str]) -> int:
300
+ relevant = set(self._scenario.get("required_evidence", [])) | set(
301
+ self._scenario.get("supporting_evidence", [])
302
+ )
303
+ new_relevant = 0
304
+ for evidence_id in evidence_ids:
305
+ if evidence_id not in self._discovered_evidence and evidence_id in relevant:
306
+ new_relevant += 1
307
+ self._discovered_evidence.add(evidence_id)
308
+ return new_relevant
309
+
310
+ def _filter_entries(
311
+ self, entries: list[dict[str, Any]], service_id: str = "", query: str = ""
312
+ ) -> list[dict[str, Any]]:
313
+ normalized_service = self._resolve_service_id(service_id).lower()
314
+ normalized_query = query.strip().lower()
315
+ matches: list[dict[str, Any]] = []
316
+ for entry in entries:
317
+ service_matches = (
318
+ not normalized_service
319
+ or str(entry.get("service_id", "")).lower() == normalized_service
320
+ )
321
+ search_blob = " ".join(
322
+ [
323
+ str(entry.get("text", "")),
324
+ str(entry.get("source", "")),
325
+ " ".join(str(tag) for tag in entry.get("tags", [])),
326
+ ]
327
+ ).lower()
328
+ query_matches = not normalized_query or normalized_query in search_blob
329
+ if service_matches and query_matches:
330
+ matches.append(entry)
331
+ return matches
332
+
333
+ def _resolve_service_id(self, service_id: str) -> str:
334
+ normalized = service_id.strip()
335
+ aliases = self._scenario.get("service_aliases", {})
336
+ return str(aliases.get(normalized, normalized))
337
+
338
+ def _evidence_payload(self, entries: list[dict[str, Any]]) -> dict[str, Any]:
339
+ evidence_ids = [entry["evidence_id"] for entry in entries]
340
+ new_relevant = self._record_evidence(evidence_ids)
341
+ return {
342
+ "ok": True,
343
+ "evidence_ids": evidence_ids,
344
+ "new_relevant_evidence": new_relevant,
345
+ "entries": [
346
+ {
347
+ "evidence_id": entry["evidence_id"],
348
+ "service_id": entry.get("service_id", ""),
349
+ "source": entry.get("source", ""),
350
+ "text": entry.get("text", ""),
351
+ }
352
+ for entry in entries
353
+ ],
354
+ }
355
+
356
+ def _tool_list_assets(self, args: dict[str, Any]) -> tuple[dict[str, Any], float, bool]:
357
+ del args
358
+ return {"ok": True, "assets": self._scenario["assets"]}, 0.0, False
359
+
360
+ def _tool_get_log_events(
361
+ self, args: dict[str, Any]
362
+ ) -> tuple[dict[str, Any], float, bool]:
363
+ entries = self._filter_entries(
364
+ self._scenario.get("logs", []),
365
+ service_id=str(args.get("service_id", "")),
366
+ query=str(args.get("query", "")),
367
+ )
368
+ payload = self._evidence_payload(entries)
369
+ return payload, 0.02 * payload["new_relevant_evidence"], False
370
+
371
+ def _tool_check_security_headers(
372
+ self, args: dict[str, Any]
373
+ ) -> tuple[dict[str, Any], float, bool]:
374
+ requested_service = self._resolve_service_id(str(args.get("service_id", ""))).lower()
375
+ snapshots = self._scenario.get("headers", {})
376
+ results = []
377
+ evidence_ids = []
378
+ for service_id, snapshot in snapshots.items():
379
+ if requested_service and service_id.lower() != requested_service:
380
+ continue
381
+ evidence_ids.append(snapshot["evidence_id"])
382
+ results.append(
383
+ {
384
+ "service_id": service_id,
385
+ "evidence_id": snapshot["evidence_id"],
386
+ "present": snapshot.get("present", []),
387
+ "missing": snapshot.get("missing", []),
388
+ "passed": not snapshot.get("missing"),
389
+ }
390
+ )
391
+ new_relevant = self._record_evidence(evidence_ids)
392
+ return (
393
+ {
394
+ "ok": True,
395
+ "evidence_ids": evidence_ids,
396
+ "new_relevant_evidence": new_relevant,
397
+ "header_results": results,
398
+ },
399
+ 0.02 * new_relevant,
400
+ False,
401
+ )
402
+
403
+ def _tool_search_repo(self, args: dict[str, Any]) -> tuple[dict[str, Any], float, bool]:
404
+ entries = self._filter_entries(
405
+ self._scenario.get("repo", []), query=str(args.get("query", ""))
406
+ )
407
+ payload = self._evidence_payload(entries)
408
+ return payload, 0.02 * payload["new_relevant_evidence"], False
409
+
410
+ def _tool_scan_dependencies(
411
+ self, args: dict[str, Any]
412
+ ) -> tuple[dict[str, Any], float, bool]:
413
+ del args
414
+ payload = self._evidence_payload(self._scenario.get("dependencies", []))
415
+ return payload, 0.02 * payload["new_relevant_evidence"], False
416
+
417
+ def _tool_create_finding(
418
+ self, args: dict[str, Any]
419
+ ) -> tuple[dict[str, Any], float, bool]:
420
+ evidence_ids = args.get("evidence_ids", [])
421
+ if isinstance(evidence_ids, str):
422
+ evidence_ids = [evidence_ids]
423
+ evidence_ids = [str(evidence_id) for evidence_id in evidence_ids]
424
+
425
+ finding_id = f"FND-{len(self._candidate_findings) + 1:03d}"
426
+ finding = {
427
+ "finding_id": finding_id,
428
+ "finding_type": str(args.get("finding_type", "")),
429
+ "evidence_ids": evidence_ids,
430
+ "severity_guess": str(args.get("severity_guess", "")),
431
+ "remediation": str(args.get("remediation", "")),
432
+ "validated": False,
433
+ "matching_gt_id": None,
434
+ }
435
+ self._candidate_findings[finding_id] = finding
436
+
437
+ well_formed = bool(
438
+ finding["finding_type"] and evidence_ids and finding["remediation"]
439
+ )
440
+ return (
441
+ {"ok": True, "finding_id": finding_id, "finding": finding},
442
+ 0.03 if well_formed else 0.0,
443
+ False,
444
+ )
445
+
446
+ def _tool_validate_finding(
447
+ self, args: dict[str, Any]
448
+ ) -> tuple[dict[str, Any], float, bool]:
449
+ finding_id = str(args.get("finding_id", ""))
450
+ finding = self._candidate_findings.get(finding_id)
451
+ if finding is None:
452
+ return (
453
+ {"ok": False, "message": f"Unknown finding_id: {finding_id}"},
454
+ -0.03,
455
+ False,
456
+ )
457
+
458
+ expected_type = self._scenario["finding_type"]
459
+ required_evidence = set(self._scenario.get("required_evidence", []))
460
+ supplied_evidence = set(finding.get("evidence_ids", []))
461
+ verified = (
462
+ finding.get("finding_type") == expected_type
463
+ and bool(required_evidence & supplied_evidence)
464
+ )
465
+ self._validated_finding_ids.add(finding_id)
466
+ finding["validated"] = verified
467
+ finding["matching_gt_id"] = self._scenario["ground_truth_id"] if verified else None
468
+
469
+ if verified and not any(
470
+ item["finding_id"] == finding_id for item in self._verified_findings
471
+ ):
472
+ self._verified_findings.append(dict(finding))
473
+
474
+ return (
475
+ {
476
+ "ok": True,
477
+ "finding_id": finding_id,
478
+ "verified": verified,
479
+ "matching_gt_id": finding["matching_gt_id"],
480
+ },
481
+ 0.08 if verified else -0.02,
482
+ False,
483
+ )
484
+
485
+ def _tool_submit_report(
486
+ self, args: dict[str, Any]
487
+ ) -> tuple[dict[str, Any], float, bool]:
488
+ report = args.get("report_json", {})
489
+ score, breakdown = score_report(
490
+ self._scenario["task_id"],
491
+ report,
492
+ verified_findings=self._verified_findings,
493
+ validation_attempted=bool(self._validated_finding_ids),
494
+ )
495
+ self._last_score_breakdown = breakdown
496
+ return (
497
+ {
498
+ "ok": True,
499
+ "submitted": True,
500
+ "score": score,
501
+ "score_breakdown": breakdown,
502
+ "trajectory_jsonl": self.export_trajectory_jsonl(),
503
+ },
504
+ score,
505
+ True,
506
+ )
server/__init__.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """Cyber Analyst environment server components."""
8
+
9
+ from .Cyber_analyst_environment import CyberAnalystEnvironment
10
+
11
+ __all__ = ["CyberAnalystEnvironment"]
server/app.py ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """
8
+ FastAPI application for the Cyber Analyst Environment.
9
+
10
+ This module creates an HTTP server that exposes the CyberAnalystEnvironment
11
+ over HTTP and WebSocket endpoints, compatible with EnvClient.
12
+
13
+ Endpoints:
14
+ - POST /reset: Reset the environment
15
+ - POST /step: Execute an action
16
+ - GET /state: Get current environment state
17
+ - GET /schema: Get action/observation schemas
18
+ - WS /ws: WebSocket endpoint for persistent sessions
19
+
20
+ Usage:
21
+ # Development (with auto-reload):
22
+ uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
23
+
24
+ # Production:
25
+ uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4
26
+
27
+ # Or run directly:
28
+ python -m server.app
29
+ """
30
+
31
+ try:
32
+ from openenv.core.env_server.http_server import create_app
33
+ except Exception as e: # pragma: no cover
34
+ raise ImportError(
35
+ "openenv is required for the web interface. Install dependencies with '\n uv sync\n'"
36
+ ) from e
37
+
38
+ try:
39
+ from ..models import CyberAnalystAction, CyberAnalystObservation
40
+ from .Cyber_analyst_environment import CyberAnalystEnvironment
41
+ except ImportError:
42
+ from models import CyberAnalystAction, CyberAnalystObservation
43
+ from server.Cyber_analyst_environment import CyberAnalystEnvironment
44
+
45
+
46
+ # Create the app with web interface and README integration
47
+ app = create_app(
48
+ CyberAnalystEnvironment,
49
+ CyberAnalystAction,
50
+ CyberAnalystObservation,
51
+ env_name="Cyber_analyst",
52
+ max_concurrent_envs=1, # increase this number to allow more concurrent WebSocket sessions
53
+ )
54
+
55
+
56
+ def main(host: str = "0.0.0.0", port: int = 8000):
57
+ """
58
+ Entry point for direct execution via uv run or python -m.
59
+
60
+ This function enables running the server without Docker:
61
+ uv run --project . server
62
+ uv run --project . server --port 8001
63
+ python -m Cyber_analyst.server.app
64
+
65
+ Args:
66
+ host: Host address to bind to (default: "0.0.0.0")
67
+ port: Port number to listen on (default: 8000)
68
+
69
+ For production deployments, consider using uvicorn directly with
70
+ multiple workers:
71
+ uvicorn Cyber_analyst.server.app:app --workers 4
72
+ """
73
+ import uvicorn
74
+
75
+ uvicorn.run(app, host=host, port=port)
76
+
77
+
78
+ if __name__ == "__main__":
79
+ main()
server/graders.py ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Deterministic graders for the Cyber Analyst OpenEnv tasks."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ from typing import Any
7
+
8
+ try:
9
+ from .tasks import SCENARIOS
10
+ except ImportError: # pragma: no cover - supports direct module execution
11
+ from tasks import SCENARIOS
12
+
13
+
14
+ MIN_SCORE = 0.01
15
+ MAX_SCORE = 0.99
16
+
17
+
18
+ def safe_reward(score: float | int | None) -> float:
19
+ """Clamp validator-facing scores to the strict open interval (0, 1)."""
20
+
21
+ try:
22
+ value = float(score if score is not None else 0.0)
23
+ except (TypeError, ValueError):
24
+ value = 0.0
25
+ return max(MIN_SCORE, min(MAX_SCORE, value))
26
+
27
+
28
+ def _coerce_report(report: Any) -> dict[str, Any]:
29
+ if isinstance(report, dict):
30
+ return report
31
+ if isinstance(report, str):
32
+ try:
33
+ decoded = json.loads(report)
34
+ except json.JSONDecodeError:
35
+ return {"summary": report, "findings": []}
36
+ return decoded if isinstance(decoded, dict) else {"findings": []}
37
+ return {"findings": []}
38
+
39
+
40
+ def _text_contains_any(text: str, keywords: list[str]) -> bool:
41
+ lowered = text.lower()
42
+ return any(keyword.lower() in lowered for keyword in keywords)
43
+
44
+
45
+ def _report_findings(report: dict[str, Any]) -> list[dict[str, Any]]:
46
+ findings = report.get("findings", [])
47
+ if isinstance(findings, dict):
48
+ findings = [findings]
49
+ return [finding for finding in findings if isinstance(finding, dict)]
50
+
51
+
52
+ def score_report(
53
+ task_id: str,
54
+ report: Any,
55
+ verified_findings: list[dict[str, Any]] | None = None,
56
+ validation_attempted: bool = False,
57
+ ) -> tuple[float, dict[str, Any]]:
58
+ """Score a submitted report against one task's deterministic ground truth."""
59
+
60
+ scenario = SCENARIOS.get(task_id)
61
+ report_dict = _coerce_report(report)
62
+ report_findings = _report_findings(report_dict)
63
+ verified_findings = verified_findings or []
64
+
65
+ if scenario is None:
66
+ return MIN_SCORE, {"unknown_task": task_id}
67
+
68
+ expected_type = scenario["finding_type"]
69
+ expected_evidence = set(scenario.get("required_evidence", [])) | set(
70
+ scenario.get("supporting_evidence", [])
71
+ )
72
+
73
+ matching_verified = [
74
+ finding
75
+ for finding in verified_findings
76
+ if finding.get("finding_type") == expected_type
77
+ ]
78
+ matching_report = [
79
+ finding for finding in report_findings if finding.get("finding_type") == expected_type
80
+ ]
81
+
82
+ score = 0.05
83
+ breakdown: dict[str, Any] = {
84
+ "base": 0.05,
85
+ "verified_correct": 0.0,
86
+ "valid_evidence": 0.0,
87
+ "actionable_remediation": 0.0,
88
+ "hallucination_penalty": 0.0,
89
+ "validation_penalty": 0.0,
90
+ }
91
+
92
+ if matching_verified and matching_report:
93
+ impact_text = " ".join(
94
+ str(finding.get("impact", "")) + " " + str(finding.get("description", ""))
95
+ for finding in matching_report
96
+ )
97
+ if _text_contains_any(impact_text, scenario.get("impact_keywords", [])):
98
+ score += 0.60
99
+ breakdown["verified_correct"] = 0.60
100
+
101
+ report_evidence: set[str] = set()
102
+ for finding in matching_report:
103
+ evidence_ids = finding.get("evidence_ids", [])
104
+ if isinstance(evidence_ids, str):
105
+ evidence_ids = [evidence_ids]
106
+ report_evidence.update(str(evidence_id) for evidence_id in evidence_ids)
107
+
108
+ if report_evidence & expected_evidence:
109
+ score += 0.15
110
+ breakdown["valid_evidence"] = 0.15
111
+
112
+ remediation_text = " ".join(
113
+ str(finding.get("remediation", "")) for finding in matching_report
114
+ )
115
+ if _text_contains_any(remediation_text, scenario.get("remediation_keywords", [])):
116
+ score += 0.15
117
+ breakdown["actionable_remediation"] = 0.15
118
+
119
+ verified_types = {finding.get("finding_type") for finding in verified_findings}
120
+ hallucinated = [
121
+ finding
122
+ for finding in report_findings
123
+ if finding.get("finding_type") not in verified_types
124
+ ]
125
+ if hallucinated:
126
+ penalty = 0.40 * len(hallucinated)
127
+ score -= penalty
128
+ breakdown["hallucination_penalty"] = -penalty
129
+
130
+ if not validation_attempted:
131
+ score -= 0.20
132
+ breakdown["validation_penalty"] = -0.20
133
+
134
+ final_score = safe_reward(score)
135
+ breakdown["raw_score"] = round(score, 4)
136
+ breakdown["score"] = final_score
137
+ return final_score, breakdown
138
+
139
+
140
+ def _payload_from_args(*args: Any, **kwargs: Any) -> dict[str, Any]:
141
+ if args and isinstance(args[0], dict):
142
+ payload = dict(args[0])
143
+ else:
144
+ payload = {}
145
+ payload.update(kwargs)
146
+ return payload
147
+
148
+
149
+ def grade_task(task_id: str, *args: Any, **kwargs: Any) -> float:
150
+ """Manifest-friendly grader adapter."""
151
+
152
+ payload = _payload_from_args(*args, **kwargs)
153
+ report = payload.get("report") or payload.get("report_json") or payload
154
+ verified_findings = payload.get("verified_findings", [])
155
+ validation_attempted = bool(payload.get("validation_attempted", False))
156
+ score, _ = score_report(task_id, report, verified_findings, validation_attempted)
157
+ return score
158
+
159
+
160
+ def grade_secret_exposure_easy(*args: Any, **kwargs: Any) -> float:
161
+ return grade_task("secret_exposure_easy", *args, **kwargs)
162
+
163
+
164
+ def grade_missing_security_headers_medium(*args: Any, **kwargs: Any) -> float:
165
+ return grade_task("missing_security_headers_medium", *args, **kwargs)
166
+
167
+
168
+ def grade_authz_boundary_hard(*args: Any, **kwargs: Any) -> float:
169
+ return grade_task("authz_boundary_hard", *args, **kwargs)
server/requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ openenv[core]>=0.2.0
2
+ fastapi>=0.115.0
3
+ uvicorn>=0.24.0
4
+ openai>=1.0.0
5
+
6
+
server/tasks.py ADDED
@@ -0,0 +1,283 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Deterministic scenario registry for SecOps Evidence Gym."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from copy import deepcopy
6
+ from random import Random
7
+ from typing import Any
8
+
9
+
10
+ DEFAULT_TASK_ID = "secret_exposure_easy"
11
+
12
+ TOOL_CATALOG: list[dict[str, Any]] = [
13
+ {
14
+ "name": "list_assets",
15
+ "description": "List synthetic services, routes, and artifact collections.",
16
+ "args": {},
17
+ },
18
+ {
19
+ "name": "get_log_events",
20
+ "description": "Return sanitized telemetry evidence ids for a service/query.",
21
+ "args": {"service_id": "str", "query": "str"},
22
+ },
23
+ {
24
+ "name": "check_security_headers",
25
+ "description": "Inspect a service header snapshot and return pass/fail evidence.",
26
+ "args": {"service_id": "str"},
27
+ },
28
+ {
29
+ "name": "search_repo",
30
+ "description": "Search synthetic repo/config snippets for evidence ids.",
31
+ "args": {"query": "str"},
32
+ },
33
+ {
34
+ "name": "scan_dependencies",
35
+ "description": "Inspect a synthetic dependency manifest excerpt.",
36
+ "args": {},
37
+ },
38
+ {
39
+ "name": "create_finding",
40
+ "description": "Store a candidate finding for verifier review.",
41
+ "args": {
42
+ "finding_type": "str",
43
+ "evidence_ids": "list[str]",
44
+ "severity_guess": "str",
45
+ "remediation": "str",
46
+ },
47
+ },
48
+ {
49
+ "name": "validate_finding",
50
+ "description": "Run the deterministic verifier for a candidate finding.",
51
+ "args": {"finding_id": "str"},
52
+ },
53
+ {
54
+ "name": "submit_report",
55
+ "description": "Submit the final structured report and end the episode.",
56
+ "args": {"report_json": "dict"},
57
+ },
58
+ ]
59
+
60
+
61
+ SCENARIOS: dict[str, dict[str, Any]] = {
62
+ "secret_exposure_easy": {
63
+ "task_id": "secret_exposure_easy",
64
+ "difficulty": "easy",
65
+ "title": "Secret exposure in repo snapshot",
66
+ "alert": (
67
+ "Repository hygiene alert: a synthetic config snapshot may contain "
68
+ "credential-like material. Investigate, cite evidence, validate, and "
69
+ "submit a remediation report."
70
+ ),
71
+ "ground_truth_id": "GT-SECRET-001",
72
+ "finding_type": "secret_exposure",
73
+ "severity": "high",
74
+ "required_evidence": ["EVID-101"],
75
+ "remediation_keywords": ["rotate", "remove"],
76
+ "impact_keywords": ["secret", "credential", "api key", "leak"],
77
+ "assets": {
78
+ "services": ["gateway", "profile-service", "admin-service"],
79
+ "routes": ["/login", "/profile", "/admin/export"],
80
+ "collections": ["repo_snapshot", "telemetry", "headers", "dependencies"],
81
+ },
82
+ "repo": [
83
+ {
84
+ "evidence_id": "EVID-101",
85
+ "source": "repo_snapshot/config/profile-service.env",
86
+ "service_id": "profile-service",
87
+ "tags": ["api key", "secret", "credential", "config"],
88
+ "text": (
89
+ "PROFILE_EXPORT_API_KEY=CYBERLAB_DO_NOT_USE_API_KEY_7F3A "
90
+ "# synthetic lab credential"
91
+ ),
92
+ },
93
+ {
94
+ "evidence_id": "EVID-102",
95
+ "source": "repo_snapshot/config/gateway.env",
96
+ "service_id": "gateway",
97
+ "tags": ["config", "benign"],
98
+ "text": "LOG_LEVEL=info; FEATURE_SAFE_HEADERS=true",
99
+ },
100
+ ],
101
+ "logs": [
102
+ {
103
+ "evidence_id": "EVID-103",
104
+ "service_id": "profile-service",
105
+ "tags": ["benign", "startup"],
106
+ "text": "profile-service boot completed for synthetic tenant secops-lab",
107
+ }
108
+ ],
109
+ "headers": {
110
+ "gateway": {
111
+ "evidence_id": "EVID-104",
112
+ "present": ["Strict-Transport-Security", "Content-Security-Policy"],
113
+ "missing": [],
114
+ }
115
+ },
116
+ "dependencies": [
117
+ {
118
+ "evidence_id": "EVID-105",
119
+ "source": "repo_snapshot/requirements.lock",
120
+ "tags": ["dependency", "benign"],
121
+ "text": "fastapi==0.115.0; pydantic==2.8.2",
122
+ }
123
+ ],
124
+ },
125
+ "missing_security_headers_medium": {
126
+ "task_id": "missing_security_headers_medium",
127
+ "difficulty": "medium",
128
+ "title": "Missing security headers",
129
+ "alert": (
130
+ "Gateway response-hardening alert: verify whether required security "
131
+ "headers are missing or weak and submit evidence-backed remediation."
132
+ ),
133
+ "ground_truth_id": "GT-HEADERS-001",
134
+ "finding_type": "missing_security_headers",
135
+ "severity": "medium",
136
+ "required_evidence": ["EVID-201"],
137
+ "remediation_keywords": ["hsts", "csp"],
138
+ "impact_keywords": ["header", "hsts", "csp", "clickjacking"],
139
+ "assets": {
140
+ "services": ["gateway", "profile-service", "admin-service"],
141
+ "routes": ["/login", "/profile", "/admin/export"],
142
+ "collections": ["repo_snapshot", "telemetry", "headers", "dependencies"],
143
+ },
144
+ "repo": [
145
+ {
146
+ "evidence_id": "EVID-202",
147
+ "source": "repo_snapshot/gateway/security_headers.py",
148
+ "service_id": "gateway",
149
+ "tags": ["headers", "config"],
150
+ "text": "X-Frame-Options is set, but HSTS and CSP are not configured.",
151
+ }
152
+ ],
153
+ "logs": [
154
+ {
155
+ "evidence_id": "EVID-203",
156
+ "service_id": "gateway",
157
+ "tags": ["benign", "response"],
158
+ "text": "GET /profile 200 request_id=req-442 synthetic header audit",
159
+ }
160
+ ],
161
+ "headers": {
162
+ "gateway": {
163
+ "evidence_id": "EVID-201",
164
+ "present": ["X-Frame-Options", "X-Content-Type-Options"],
165
+ "missing": ["Strict-Transport-Security", "Content-Security-Policy"],
166
+ }
167
+ },
168
+ "dependencies": [
169
+ {
170
+ "evidence_id": "EVID-204",
171
+ "source": "repo_snapshot/requirements.lock",
172
+ "tags": ["dependency", "benign"],
173
+ "text": "starlette==0.38.2; uvicorn==0.30.1",
174
+ }
175
+ ],
176
+ },
177
+ "authz_boundary_hard": {
178
+ "task_id": "authz_boundary_hard",
179
+ "difficulty": "hard",
180
+ "title": "Authorisation boundary misconfiguration",
181
+ "alert": (
182
+ "Access-control drift alert: investigate a route/role mismatch in the "
183
+ "synthetic admin service and submit a validated remediation report."
184
+ ),
185
+ "ground_truth_id": "GT-AUTHZ-001",
186
+ "finding_type": "authz_boundary_misconfiguration",
187
+ "severity": "critical",
188
+ "required_evidence": ["EVID-301"],
189
+ "supporting_evidence": ["EVID-302"],
190
+ "remediation_keywords": ["least privilege", "policy", "regression"],
191
+ "impact_keywords": ["authorization", "authorisation", "role", "admin"],
192
+ "assets": {
193
+ "services": ["gateway", "profile-service", "admin-service"],
194
+ "routes": ["/login", "/profile", "/admin/export"],
195
+ "collections": ["repo_snapshot", "telemetry", "headers", "dependencies"],
196
+ },
197
+ "repo": [
198
+ {
199
+ "evidence_id": "EVID-301",
200
+ "source": "repo_snapshot/admin-service/policy_matrix.yaml",
201
+ "service_id": "admin-service",
202
+ "tags": ["authorization", "role", "policy", "admin export"],
203
+ "text": (
204
+ "route=/admin/export allowed_roles=[admin, analyst] "
205
+ "expected_roles=[admin]"
206
+ ),
207
+ }
208
+ ],
209
+ "logs": [
210
+ {
211
+ "evidence_id": "EVID-302",
212
+ "service_id": "admin-service",
213
+ "tags": ["authorization", "role", "admin export"],
214
+ "text": (
215
+ "request_id=req-913 route=/admin/export role=analyst "
216
+ "decision=allow synthetic boundary-check event"
217
+ ),
218
+ },
219
+ {
220
+ "evidence_id": "EVID-303",
221
+ "service_id": "gateway",
222
+ "tags": ["benign", "auth"],
223
+ "text": "request_id=req-912 route=/profile role=user decision=allow",
224
+ },
225
+ ],
226
+ "headers": {
227
+ "admin-service": {
228
+ "evidence_id": "EVID-304",
229
+ "present": ["Strict-Transport-Security", "Content-Security-Policy"],
230
+ "missing": [],
231
+ }
232
+ },
233
+ "dependencies": [
234
+ {
235
+ "evidence_id": "EVID-305",
236
+ "source": "repo_snapshot/requirements.lock",
237
+ "tags": ["dependency", "benign"],
238
+ "text": "pyyaml==6.0.2; fastapi==0.115.0",
239
+ }
240
+ ],
241
+ },
242
+ }
243
+
244
+
245
+ def list_task_ids() -> list[str]:
246
+ return list(SCENARIOS)
247
+
248
+
249
+ def build_scenario(task_id: str | None, seed: int | None = None) -> dict[str, Any]:
250
+ """Return a deep-copied scenario with deterministic benign variation."""
251
+
252
+ selected_task_id = task_id if task_id in SCENARIOS else DEFAULT_TASK_ID
253
+ scenario = deepcopy(SCENARIOS[selected_task_id])
254
+ scenario["seed"] = seed
255
+
256
+ rng = Random(seed if seed is not None else 0)
257
+ service_alias_sets = [
258
+ ["gateway", "profile-service", "admin-service"],
259
+ ["edge-gateway", "user-profile", "admin-service"],
260
+ ["public-gateway", "profile-api", "backoffice-admin"],
261
+ ]
262
+ aliases = service_alias_sets[rng.randrange(len(service_alias_sets))]
263
+ original_services = scenario["assets"]["services"]
264
+ alias_map = dict(zip(original_services, aliases, strict=True))
265
+
266
+ scenario["service_aliases"] = alias_map
267
+ scenario["assets"]["services"] = [alias_map.get(s, s) for s in original_services]
268
+
269
+ for collection_name in ("repo", "logs"):
270
+ for item in scenario.get(collection_name, []):
271
+ service_id = item.get("service_id")
272
+ if service_id in alias_map:
273
+ item["service_id"] = alias_map[service_id]
274
+
275
+ scenario["headers"] = {
276
+ alias_map.get(service_id, service_id): snapshot
277
+ for service_id, snapshot in scenario.get("headers", {}).items()
278
+ }
279
+
280
+ for entries_name in ("repo", "logs", "dependencies"):
281
+ rng.shuffle(scenario.get(entries_name, []))
282
+
283
+ return scenario
tests/test_environment.py ADDED
@@ -0,0 +1,187 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from Cyber_analyst.models import CyberAnalystAction
2
+ from Cyber_analyst.server.Cyber_analyst_environment import CyberAnalystEnvironment
3
+ from Cyber_analyst.server.graders import (
4
+ grade_authz_boundary_hard,
5
+ grade_missing_security_headers_medium,
6
+ grade_secret_exposure_easy,
7
+ safe_reward,
8
+ )
9
+
10
+
11
+ def _run_success_path(task_id, actions):
12
+ env = CyberAnalystEnvironment()
13
+ obs = env.reset(task_id=task_id, seed=7)
14
+ assert obs.task_id == task_id
15
+
16
+ for action in actions:
17
+ obs = env.step(action)
18
+
19
+ assert obs.done is True
20
+ assert obs.tool_result["score"] > 0.5
21
+ assert 0.01 <= obs.tool_result["score"] <= 0.99
22
+ assert obs.error == ""
23
+ return obs
24
+
25
+
26
+ def test_secret_exposure_success_path():
27
+ report = {
28
+ "findings": [
29
+ {
30
+ "finding_type": "secret_exposure",
31
+ "evidence_ids": ["EVID-101"],
32
+ "impact": "A synthetic API key secret is exposed in config.",
33
+ "remediation": "Remove the key and rotate the credential.",
34
+ }
35
+ ]
36
+ }
37
+ obs = _run_success_path(
38
+ "secret_exposure_easy",
39
+ [
40
+ CyberAnalystAction(tool_name="search_repo", args={"query": "api key"}),
41
+ CyberAnalystAction(
42
+ tool_name="create_finding",
43
+ args={
44
+ "finding_type": "secret_exposure",
45
+ "evidence_ids": ["EVID-101"],
46
+ "severity_guess": "high",
47
+ "remediation": "Remove and rotate the synthetic credential.",
48
+ },
49
+ ),
50
+ CyberAnalystAction(tool_name="validate_finding", args={"finding_id": "FND-001"}),
51
+ CyberAnalystAction(tool_name="submit_report", args={"report_json": report}),
52
+ ],
53
+ )
54
+ assert obs.verified_findings[0]["matching_gt_id"] == "GT-SECRET-001"
55
+ assert "trajectory_jsonl" in obs.tool_result
56
+ assert "search_repo" in obs.tool_result["trajectory_jsonl"]
57
+
58
+
59
+ def test_missing_security_headers_success_path():
60
+ report = {
61
+ "findings": [
62
+ {
63
+ "finding_type": "missing_security_headers",
64
+ "evidence_ids": ["EVID-201"],
65
+ "impact": "The gateway is missing HSTS and CSP headers.",
66
+ "remediation": "Add HSTS and CSP at the gateway.",
67
+ }
68
+ ]
69
+ }
70
+ obs = _run_success_path(
71
+ "missing_security_headers_medium",
72
+ [
73
+ CyberAnalystAction(
74
+ tool_name="check_security_headers", args={"service_id": "gateway"}
75
+ ),
76
+ CyberAnalystAction(
77
+ tool_name="create_finding",
78
+ args={
79
+ "finding_type": "missing_security_headers",
80
+ "evidence_ids": ["EVID-201"],
81
+ "severity_guess": "medium",
82
+ "remediation": "Add HSTS and CSP headers.",
83
+ },
84
+ ),
85
+ CyberAnalystAction(tool_name="validate_finding", args={"finding_id": "FND-001"}),
86
+ CyberAnalystAction(tool_name="submit_report", args={"report_json": report}),
87
+ ],
88
+ )
89
+ assert obs.score_breakdown["valid_evidence"] == 0.15
90
+
91
+
92
+ def test_authz_boundary_success_path_with_alias_compatible_service_ids():
93
+ report = {
94
+ "findings": [
95
+ {
96
+ "finding_type": "authz_boundary_misconfiguration",
97
+ "evidence_ids": ["EVID-301", "EVID-302"],
98
+ "impact": "The admin route authorization policy allows an analyst role.",
99
+ "remediation": "Apply least privilege in the policy and add a regression test.",
100
+ }
101
+ ]
102
+ }
103
+ obs = _run_success_path(
104
+ "authz_boundary_hard",
105
+ [
106
+ CyberAnalystAction(tool_name="list_assets", args={}),
107
+ CyberAnalystAction(
108
+ tool_name="get_log_events",
109
+ args={"service_id": "admin-service", "query": "admin export"},
110
+ ),
111
+ CyberAnalystAction(tool_name="search_repo", args={"query": "admin export"}),
112
+ CyberAnalystAction(
113
+ tool_name="create_finding",
114
+ args={
115
+ "finding_type": "authz_boundary_misconfiguration",
116
+ "evidence_ids": ["EVID-301", "EVID-302"],
117
+ "severity_guess": "critical",
118
+ "remediation": "Apply least privilege and add a regression test.",
119
+ },
120
+ ),
121
+ CyberAnalystAction(tool_name="validate_finding", args={"finding_id": "FND-001"}),
122
+ CyberAnalystAction(tool_name="submit_report", args={"report_json": report}),
123
+ ],
124
+ )
125
+ assert obs.score_breakdown["actionable_remediation"] == 0.15
126
+
127
+
128
+ def test_invalid_tool_returns_observation_error():
129
+ env = CyberAnalystEnvironment()
130
+ env.reset(task_id="secret_exposure_easy", seed=1)
131
+ obs = env.step(CyberAnalystAction(tool_name="shell", args={"cmd": "whoami"}))
132
+ assert obs.done is False
133
+ assert obs.error == "unsupported_tool"
134
+ assert obs.tool_result["ok"] is False
135
+
136
+
137
+ def test_hallucinated_report_scores_low_but_in_range():
138
+ env = CyberAnalystEnvironment()
139
+ env.reset(task_id="secret_exposure_easy", seed=1)
140
+ obs = env.step(
141
+ CyberAnalystAction(
142
+ tool_name="submit_report",
143
+ args={
144
+ "report_json": {
145
+ "findings": [
146
+ {
147
+ "finding_type": "remote_code_execution",
148
+ "evidence_ids": [],
149
+ "impact": "Unsupported claim.",
150
+ "remediation": "Unsupported remediation.",
151
+ }
152
+ ]
153
+ }
154
+ },
155
+ )
156
+ )
157
+ assert obs.done is True
158
+ assert obs.tool_result["score"] == 0.01
159
+
160
+
161
+ def test_repeated_action_hard_stops_episode():
162
+ env = CyberAnalystEnvironment()
163
+ env.reset(task_id="secret_exposure_easy", seed=1)
164
+ obs = None
165
+ for _ in range(6):
166
+ obs = env.step(CyberAnalystAction(tool_name="list_assets", args={}))
167
+ assert obs is not None
168
+ assert obs.done is True
169
+ assert obs.error == "repeat_hard_stop"
170
+
171
+
172
+ def test_seed_determinism_for_assets():
173
+ env_one = CyberAnalystEnvironment()
174
+ env_two = CyberAnalystEnvironment()
175
+ env_one.reset(task_id="authz_boundary_hard", seed=22)
176
+ env_two.reset(task_id="authz_boundary_hard", seed=22)
177
+ obs_one = env_one.step(CyberAnalystAction(tool_name="list_assets", args={}))
178
+ obs_two = env_two.step(CyberAnalystAction(tool_name="list_assets", args={}))
179
+ assert obs_one.tool_result == obs_two.tool_result
180
+
181
+
182
+ def test_grader_adapters_and_clamp_are_strictly_in_range():
183
+ assert safe_reward(-1) == 0.01
184
+ assert safe_reward(2) == 0.99
185
+ assert 0.01 <= grade_secret_exposure_easy() <= 0.99
186
+ assert 0.01 <= grade_missing_security_headers_medium() <= 0.99
187
+ assert 0.01 <= grade_authz_boundary_hard() <= 0.99
uv.lock ADDED
The diff for this file is too large to render. See raw diff