Spaces:
Sleeping
Sleeping
File size: 9,236 Bytes
63a6397 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | ---
title: Cyber Analyst Environment Server
emoji: 🎯
colorFrom: pink
colorTo: red
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
---
# Cyber Analyst Environment
Cyber Analyst is an OpenEnv implementation of the "SecOps Evidence Gym". It benchmarks a bounded, safe security-triage workflow: investigate synthetic artifacts, cite evidence IDs, validate candidate findings with deterministic verifiers, and submit a remediation report.
The environment contains no live targets, no real secrets, no exploit workflow, no shell, and no outbound investigation tools. All evidence is static synthetic lab data.
## Motivation
Frontier models are becoming much stronger at security-relevant reasoning. Anthropic's April 7, 2026 report, [Assessing Claude Mythos Preview's cybersecurity capabilities](https://red.anthropic.com/2026/mythos-preview/), describes a model that can identify and exploit subtle vulnerabilities across real software targets, and argues that the same capability jump should be directed toward defense.
That creates a practical gap: many modern applications are built quickly, including "vibe coded" apps whose security review may not keep pace with generation speed. This environment is a small, safe training and evaluation surface for the defensive side of that gap. The goal is to help train and benchmark smaller, more accessible models to behave like careful application-security analysts: gather evidence, avoid unsupported claims, validate findings, and recommend concrete fixes.
## Environment Description
Each episode simulates a synthetic microservice organization with three services:
- `gateway`
- `profile-service`
- `admin-service`
The agent starts from an alert and can inspect only closed-world artifact collections:
- `repo_snapshot`: static code/config snippets
- `telemetry`: sanitized log events
- `headers`: static response-header snapshots
- `dependencies`: static dependency manifest excerpts
The episode budget is 12 steps. Seeds deterministically vary benign details such as service aliases and evidence ordering while keeping the same task ground truth reproducible.
## Tasks
The manifest ships three graded tasks:
| Task id | Difficulty | Task description | Expected difficulty |
| --- | --- | --- | --- |
| `secret_exposure_easy` | easy | Find a synthetic API-key-like secret in a repo snapshot and propose removal plus rotation. | Easiest path: one focused `search_repo` call can surface the relevant evidence, then the agent must create, validate, and report the finding. |
| `missing_security_headers_medium` | medium | Detect missing HSTS/CSP headers in a synthetic gateway header snapshot. | Requires choosing the purpose-built `check_security_headers` tool and mapping missing headers to remediation instead of over-searching unrelated artifacts. |
| `authz_boundary_hard` | hard | Detect an admin route role-policy mismatch without exploitation. | Requires correlating route/role policy evidence with a supporting log event and recommending least-privilege policy remediation plus regression testing. |
## Action Space
Each `step` accepts exactly one bounded simulator tool call:
```python
CyberAnalystAction(
tool_name="search_repo",
args={"query": "api key"},
)
```
Approved tools:
| Tool | Arguments | Purpose |
| --- | --- | --- |
| `list_assets` | `{}` | List synthetic services, routes, and artifact collections. |
| `get_log_events` | `{"service_id": "str", "query": "str"}` | Return sanitized telemetry evidence IDs for a service/query. |
| `check_security_headers` | `{"service_id": "str"}` | Inspect a service header snapshot and return pass/fail evidence. |
| `search_repo` | `{"query": "str"}` | Search synthetic repo/config snippets for evidence IDs. |
| `scan_dependencies` | `{}` | Inspect a synthetic dependency manifest excerpt. |
| `create_finding` | `{"finding_type": "str", "evidence_ids": ["str"], "severity_guess": "str", "remediation": "str"}` | Store a candidate finding for verifier review. |
| `validate_finding` | `{"finding_id": "str"}` | Run the deterministic verifier for a candidate finding. |
| `submit_report` | `{"report_json": {"findings": [...]}}` | Submit the final structured report and end the episode. |
Unsupported tools return an observation error instead of running arbitrary commands. Repeating the exact same action is penalized, and six repeated identical actions hard-stop the episode.
## Observation Space
Each observation is a `CyberAnalystObservation` with:
| Field | Definition |
| --- | --- |
| `task_id` | Current benchmark task ID. |
| `alert` | Initial alert or task prompt. |
| `phase` | Current episode phase, usually `investigate` or `done`. |
| `tool_catalog` | Approved tool list and argument schemas. |
| `tool_result` | Result returned by the latest tool call. |
| `evidence_ids` | Evidence IDs discovered so far. |
| `candidate_findings` | Candidate findings created by the agent. |
| `verified_findings` | Verifier-confirmed findings. |
| `step_budget_remaining` | Steps remaining before timeout. |
| `score_breakdown` | Deterministic final scoring explanation after report submission. |
| `error` | Non-fatal environment error, if any. |
| `done` | Whether the episode has ended. |
| `reward` | Step reward clamped to the validator-compatible range. |
`submit_report` also returns `trajectory_jsonl`, a JSONL export of the episode events up to report submission. This is intended for offline inspection and future training data extraction.
## Scoring
Final reports are scored deterministically:
- base score: `0.05`
- verified correct finding with matching report impact: `+0.60`
- valid evidence ID in the report: `+0.15`
- actionable remediation keywords: `+0.15`
- hallucinated or unverified finding claims: `-0.40` each
- submitting without verifier validation: `-0.20`
Rewards and final scores are clamped to `0.01..0.99` for validator compatibility.
## Baseline Scores
The current deterministic oracle rollout follows the intended evidence -> finding -> validation -> report path for each task. These scores were measured locally against the environment with `seed=7`.
| Task id | Baseline type | Steps | Final score | Step rewards |
| --- | --- | ---: | ---: | --- |
| `secret_exposure_easy` | deterministic oracle | 4 | `0.95` | `0.05, 0.06, 0.11, 0.98` |
| `missing_security_headers_medium` | deterministic oracle | 4 | `0.95` | `0.05, 0.06, 0.11, 0.98` |
| `authz_boundary_hard` | deterministic oracle | 6 | `0.95` | `0.03, 0.05, 0.05, 0.06, 0.11, 0.98` |
A hallucinated one-step report scores `0.01`; repeated identical actions hard-stop at a low score.
## Setup
From this directory, install dependencies:
```bash
uv sync
```
Run the local server:
```bash
uv run server
```
Health check:
```bash
curl http://localhost:8000/health
```
Then connect with the client:
```python
from Cyber_analyst import CyberAnalystAction, CyberAnalystEnv
with CyberAnalystEnv(base_url="http://localhost:8000").sync() as env:
result = env.reset(task_id="secret_exposure_easy", seed=7)
result = env.step(CyberAnalystAction(tool_name="search_repo", args={"query": "api key"}))
print(result.observation.tool_result)
```
## Baseline Inference
`inference.py` runs a model-backed baseline over the configured task set and prints strict parser-friendly logs:
```text
[START] task=<task_id> env=Cyber_analyst model=<model_name>
[STEP] step=<n> action=<compact_json_action> reward=<0.00> done=<true|false> error=<msg|null>
[END] task=<task_id> success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...>
```
The script uses the OpenAI SDK with Hugging Face Inference Providers by default:
```powershell
$env:ENV_URL = "http://localhost:8000"
$env:API_BASE_URL = "https://router.huggingface.co/v1"
$env:MODEL_NAME = "google/gemma-4-31B-it:fastest"
$env:HF_TOKEN = "<your-hugging-face-token>"
python inference.py
```
Use `$env:TASK_NAME = "<task_id>"` to run one task instead of all three.
## Validation
Useful local checks:
```bash
python -m py_compile server/Cyber_analyst_environment.py inference.py
python -m pytest tests
.\.venv\Scripts\openenv.exe validate . --json
```
## Docker
Build the environment image from this directory:
```bash
docker build -t cyber-analyst-env:latest -f server/Dockerfile .
```
Run:
```bash
docker run -p 8000:8000 cyber-analyst-env:latest
```
Health check:
```bash
curl http://localhost:8000/health
```
## Deployment
Deploy to Hugging Face Spaces with OpenEnv:
```bash
openenv push --repo-id <your-hf-username>/Cyber_analyst
```
The deployed Space exposes `/health`, `/docs`, `/ws`, and the optional `/web` interface when web UI support is enabled by the OpenEnv runtime.
## Adding Scenarios
Add new safe scenarios in `server/tasks.py` by extending `SCENARIOS` with:
- a stable `task_id`
- synthetic `assets`, `repo`, `logs`, `headers`, and `dependencies` entries
- `ground_truth_id`, `finding_type`, `required_evidence`, `impact_keywords`, and `remediation_keywords`
Then add a grader adapter in `server/graders.py` and a matching `tasks` entry in `openenv.yaml`. Keep all artifacts synthetic, keep correctness deterministic, and avoid adding real targets or arbitrary execution tools.
|