File size: 9,236 Bytes
63a6397
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
---
title: Cyber Analyst Environment Server
emoji: 🎯
colorFrom: pink
colorTo: red
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
---

# Cyber Analyst Environment

Cyber Analyst is an OpenEnv implementation of the "SecOps Evidence Gym". It benchmarks a bounded, safe security-triage workflow: investigate synthetic artifacts, cite evidence IDs, validate candidate findings with deterministic verifiers, and submit a remediation report.

The environment contains no live targets, no real secrets, no exploit workflow, no shell, and no outbound investigation tools. All evidence is static synthetic lab data.

## Motivation

Frontier models are becoming much stronger at security-relevant reasoning. Anthropic's April 7, 2026 report, [Assessing Claude Mythos Preview's cybersecurity capabilities](https://red.anthropic.com/2026/mythos-preview/), describes a model that can identify and exploit subtle vulnerabilities across real software targets, and argues that the same capability jump should be directed toward defense.

That creates a practical gap: many modern applications are built quickly, including "vibe coded" apps whose security review may not keep pace with generation speed. This environment is a small, safe training and evaluation surface for the defensive side of that gap. The goal is to help train and benchmark smaller, more accessible models to behave like careful application-security analysts: gather evidence, avoid unsupported claims, validate findings, and recommend concrete fixes.

## Environment Description

Each episode simulates a synthetic microservice organization with three services:

- `gateway`
- `profile-service`
- `admin-service`

The agent starts from an alert and can inspect only closed-world artifact collections:

- `repo_snapshot`: static code/config snippets
- `telemetry`: sanitized log events
- `headers`: static response-header snapshots
- `dependencies`: static dependency manifest excerpts

The episode budget is 12 steps. Seeds deterministically vary benign details such as service aliases and evidence ordering while keeping the same task ground truth reproducible.

## Tasks

The manifest ships three graded tasks:

| Task id | Difficulty | Task description | Expected difficulty |
| --- | --- | --- | --- |
| `secret_exposure_easy` | easy | Find a synthetic API-key-like secret in a repo snapshot and propose removal plus rotation. | Easiest path: one focused `search_repo` call can surface the relevant evidence, then the agent must create, validate, and report the finding. |
| `missing_security_headers_medium` | medium | Detect missing HSTS/CSP headers in a synthetic gateway header snapshot. | Requires choosing the purpose-built `check_security_headers` tool and mapping missing headers to remediation instead of over-searching unrelated artifacts. |
| `authz_boundary_hard` | hard | Detect an admin route role-policy mismatch without exploitation. | Requires correlating route/role policy evidence with a supporting log event and recommending least-privilege policy remediation plus regression testing. |

## Action Space

Each `step` accepts exactly one bounded simulator tool call:

```python
CyberAnalystAction(
    tool_name="search_repo",
    args={"query": "api key"},
)
```

Approved tools:

| Tool | Arguments | Purpose |
| --- | --- | --- |
| `list_assets` | `{}` | List synthetic services, routes, and artifact collections. |
| `get_log_events` | `{"service_id": "str", "query": "str"}` | Return sanitized telemetry evidence IDs for a service/query. |
| `check_security_headers` | `{"service_id": "str"}` | Inspect a service header snapshot and return pass/fail evidence. |
| `search_repo` | `{"query": "str"}` | Search synthetic repo/config snippets for evidence IDs. |
| `scan_dependencies` | `{}` | Inspect a synthetic dependency manifest excerpt. |
| `create_finding` | `{"finding_type": "str", "evidence_ids": ["str"], "severity_guess": "str", "remediation": "str"}` | Store a candidate finding for verifier review. |
| `validate_finding` | `{"finding_id": "str"}` | Run the deterministic verifier for a candidate finding. |
| `submit_report` | `{"report_json": {"findings": [...]}}` | Submit the final structured report and end the episode. |

Unsupported tools return an observation error instead of running arbitrary commands. Repeating the exact same action is penalized, and six repeated identical actions hard-stop the episode.

## Observation Space

Each observation is a `CyberAnalystObservation` with:

| Field | Definition |
| --- | --- |
| `task_id` | Current benchmark task ID. |
| `alert` | Initial alert or task prompt. |
| `phase` | Current episode phase, usually `investigate` or `done`. |
| `tool_catalog` | Approved tool list and argument schemas. |
| `tool_result` | Result returned by the latest tool call. |
| `evidence_ids` | Evidence IDs discovered so far. |
| `candidate_findings` | Candidate findings created by the agent. |
| `verified_findings` | Verifier-confirmed findings. |
| `step_budget_remaining` | Steps remaining before timeout. |
| `score_breakdown` | Deterministic final scoring explanation after report submission. |
| `error` | Non-fatal environment error, if any. |
| `done` | Whether the episode has ended. |
| `reward` | Step reward clamped to the validator-compatible range. |

`submit_report` also returns `trajectory_jsonl`, a JSONL export of the episode events up to report submission. This is intended for offline inspection and future training data extraction.

## Scoring

Final reports are scored deterministically:

- base score: `0.05`
- verified correct finding with matching report impact: `+0.60`
- valid evidence ID in the report: `+0.15`
- actionable remediation keywords: `+0.15`
- hallucinated or unverified finding claims: `-0.40` each
- submitting without verifier validation: `-0.20`

Rewards and final scores are clamped to `0.01..0.99` for validator compatibility.

## Baseline Scores

The current deterministic oracle rollout follows the intended evidence -> finding -> validation -> report path for each task. These scores were measured locally against the environment with `seed=7`.

| Task id | Baseline type | Steps | Final score | Step rewards |
| --- | --- | ---: | ---: | --- |
| `secret_exposure_easy` | deterministic oracle | 4 | `0.95` | `0.05, 0.06, 0.11, 0.98` |
| `missing_security_headers_medium` | deterministic oracle | 4 | `0.95` | `0.05, 0.06, 0.11, 0.98` |
| `authz_boundary_hard` | deterministic oracle | 6 | `0.95` | `0.03, 0.05, 0.05, 0.06, 0.11, 0.98` |

A hallucinated one-step report scores `0.01`; repeated identical actions hard-stop at a low score.

## Setup

From this directory, install dependencies:

```bash
uv sync
```

Run the local server:

```bash
uv run server
```

Health check:

```bash
curl http://localhost:8000/health
```

Then connect with the client:

```python
from Cyber_analyst import CyberAnalystAction, CyberAnalystEnv

with CyberAnalystEnv(base_url="http://localhost:8000").sync() as env:
    result = env.reset(task_id="secret_exposure_easy", seed=7)
    result = env.step(CyberAnalystAction(tool_name="search_repo", args={"query": "api key"}))
    print(result.observation.tool_result)
```

## Baseline Inference

`inference.py` runs a model-backed baseline over the configured task set and prints strict parser-friendly logs:

```text
[START] task=<task_id> env=Cyber_analyst model=<model_name>
[STEP] step=<n> action=<compact_json_action> reward=<0.00> done=<true|false> error=<msg|null>
[END] task=<task_id> success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...>
```

The script uses the OpenAI SDK with Hugging Face Inference Providers by default:

```powershell
$env:ENV_URL = "http://localhost:8000"
$env:API_BASE_URL = "https://router.huggingface.co/v1"
$env:MODEL_NAME = "google/gemma-4-31B-it:fastest"
$env:HF_TOKEN = "<your-hugging-face-token>"
python inference.py
```

Use `$env:TASK_NAME = "<task_id>"` to run one task instead of all three.

## Validation

Useful local checks:

```bash
python -m py_compile server/Cyber_analyst_environment.py inference.py
python -m pytest tests
.\.venv\Scripts\openenv.exe validate . --json
```

## Docker

Build the environment image from this directory:

```bash
docker build -t cyber-analyst-env:latest -f server/Dockerfile .
```

Run:

```bash
docker run -p 8000:8000 cyber-analyst-env:latest
```

Health check:

```bash
curl http://localhost:8000/health
```

## Deployment

Deploy to Hugging Face Spaces with OpenEnv:

```bash
openenv push --repo-id <your-hf-username>/Cyber_analyst
```

The deployed Space exposes `/health`, `/docs`, `/ws`, and the optional `/web` interface when web UI support is enabled by the OpenEnv runtime.

## Adding Scenarios

Add new safe scenarios in `server/tasks.py` by extending `SCENARIOS` with:

- a stable `task_id`
- synthetic `assets`, `repo`, `logs`, `headers`, and `dependencies` entries
- `ground_truth_id`, `finding_type`, `required_evidence`, `impact_keywords`, and `remediation_keywords`

Then add a grader adapter in `server/graders.py` and a matching `tasks` entry in `openenv.yaml`. Keep all artifacts synthetic, keep correctness deterministic, and avoid adding real targets or arbitrary execution tools.