Spaces:

Humanlearning
/

Cyber_analyst

Sleeping

File size: 9,236 Bytes

63a6397

---
title: Cyber Analyst Environment Server
emoji: 🎯
colorFrom: pink
colorTo: red
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
---

# Cyber Analyst Environment

Cyber Analyst is an OpenEnv implementation of the "SecOps Evidence Gym". It benchmarks a bounded, safe security-triage workflow: investigate synthetic artifacts, cite evidence IDs, validate candidate findings with deterministic verifiers, and submit a remediation report.

The environment contains no live targets, no real secrets, no exploit workflow, no shell, and no outbound investigation tools. All evidence is static synthetic lab data.

## Motivation

Frontier models are becoming much stronger at security-relevant reasoning. Anthropic's April 7, 2026 report, [Assessing Claude Mythos Preview's cybersecurity capabilities](https://red.anthropic.com/2026/mythos-preview/), describes a model that can identify and exploit subtle vulnerabilities across real software targets, and argues that the same capability jump should be directed toward defense.

That creates a practical gap: many modern applications are built quickly, including "vibe coded" apps whose security review may not keep pace with generation speed. This environment is a small, safe training and evaluation surface for the defensive side of that gap. The goal is to help train and benchmark smaller, more accessible models to behave like careful application-security analysts: gather evidence, avoid unsupported claims, validate findings, and recommend concrete fixes.

## Environment Description

Each episode simulates a synthetic microservice organization with three services:

- `gateway`
- `profile-service`
- `admin-service`

The agent starts from an alert and can inspect only closed-world artifact collections:

- `repo_snapshot`: static code/config snippets
- `telemetry`: sanitized log events
- `headers`: static response-header snapshots
- `dependencies`: static dependency manifest excerpts

The episode budget is 12 steps. Seeds deterministically vary benign details such as service aliases and evidence ordering while keeping the same task ground truth reproducible.

## Tasks

The manifest ships three graded tasks:

| Task id | Difficulty | Task description | Expected difficulty |
| --- | --- | --- | --- |
| `secret_exposure_easy` | easy | Find a synthetic API-key-like secret in a repo snapshot and propose removal plus rotation. | Easiest path: one focused `search_repo` call can surface the relevant evidence, then the agent must create, validate, and report the finding. |
| `missing_security_headers_medium` | medium | Detect missing HSTS/CSP headers in a synthetic gateway header snapshot. | Requires choosing the purpose-built `check_security_headers` tool and mapping missing headers to remediation instead of over-searching unrelated artifacts. |
| `authz_boundary_hard` | hard | Detect an admin route role-policy mismatch without exploitation. | Requires correlating route/role policy evidence with a supporting log event and recommending least-privilege policy remediation plus regression testing. |

## Action Space

Each `step` accepts exactly one bounded simulator tool call:

```python
CyberAnalystAction(
    tool_name="search_repo",
    args={"query": "api key"},
)
```

Approved tools:

| Tool | Arguments | Purpose |
| --- | --- | --- |
| `list_assets` | `{}` | List synthetic services, routes, and artifact collections. |
| `get_log_events` | `{"service_id": "str", "query": "str"}` | Return sanitized telemetry evidence IDs for a service/query. |
| `check_security_headers` | `{"service_id": "str"}` | Inspect a service header snapshot and return pass/fail evidence. |
| `search_repo` | `{"query": "str"}` | Search synthetic repo/config snippets for evidence IDs. |
| `scan_dependencies` | `{}` | Inspect a synthetic dependency manifest excerpt. |
| `create_finding` | `{"finding_type": "str", "evidence_ids": ["str"], "severity_guess": "str", "remediation": "str"}` | Store a candidate finding for verifier review. |
| `validate_finding` | `{"finding_id": "str"}` | Run the deterministic verifier for a candidate finding. |
| `submit_report` | `{"report_json": {"findings": [...]}}` | Submit the final structured report and end the episode. |

Unsupported tools return an observation error instead of running arbitrary commands. Repeating the exact same action is penalized, and six repeated identical actions hard-stop the episode.

## Observation Space

Each observation is a `CyberAnalystObservation` with:

| Field | Definition |
| --- | --- |
| `task_id` | Current benchmark task ID. |
| `alert` | Initial alert or task prompt. |
| `phase` | Current episode phase, usually `investigate` or `done`. |
| `tool_catalog` | Approved tool list and argument schemas. |
| `tool_result` | Result returned by the latest tool call. |
| `evidence_ids` | Evidence IDs discovered so far. |
| `candidate_findings` | Candidate findings created by the agent. |
| `verified_findings` | Verifier-confirmed findings. |
| `step_budget_remaining` | Steps remaining before timeout. |
| `score_breakdown` | Deterministic final scoring explanation after report submission. |
| `error` | Non-fatal environment error, if any. |
| `done` | Whether the episode has ended. |
| `reward` | Step reward clamped to the validator-compatible range. |

`submit_report` also returns `trajectory_jsonl`, a JSONL export of the episode events up to report submission. This is intended for offline inspection and future training data extraction.

## Scoring

Final reports are scored deterministically:

- base score: `0.05`
- verified correct finding with matching report impact: `+0.60`
- valid evidence ID in the report: `+0.15`
- actionable remediation keywords: `+0.15`
- hallucinated or unverified finding claims: `-0.40` each
- submitting without verifier validation: `-0.20`

Rewards and final scores are clamped to `0.01..0.99` for validator compatibility.

## Baseline Scores

The current deterministic oracle rollout follows the intended evidence -> finding -> validation -> report path for each task. These scores were measured locally against the environment with `seed=7`.

| Task id | Baseline type | Steps | Final score | Step rewards |
| --- | --- | ---: | ---: | --- |
| `secret_exposure_easy` | deterministic oracle | 4 | `0.95` | `0.05, 0.06, 0.11, 0.98` |
| `missing_security_headers_medium` | deterministic oracle | 4 | `0.95` | `0.05, 0.06, 0.11, 0.98` |
| `authz_boundary_hard` | deterministic oracle | 6 | `0.95` | `0.03, 0.05, 0.05, 0.06, 0.11, 0.98` |

A hallucinated one-step report scores `0.01`; repeated identical actions hard-stop at a low score.

## Setup

From this directory, install dependencies:

```bash
uv sync
```

Run the local server:

```bash
uv run server
```

Health check:

```bash
curl http://localhost:8000/health
```

Then connect with the client:

```python
from Cyber_analyst import CyberAnalystAction, CyberAnalystEnv

with CyberAnalystEnv(base_url="http://localhost:8000").sync() as env:
    result = env.reset(task_id="secret_exposure_easy", seed=7)
    result = env.step(CyberAnalystAction(tool_name="search_repo", args={"query": "api key"}))
    print(result.observation.tool_result)
```

## Baseline Inference

`inference.py` runs a model-backed baseline over the configured task set and prints strict parser-friendly logs:

```text
[START] task=<task_id> env=Cyber_analyst model=<model_name>
[STEP] step=<n> action=<compact_json_action> reward=<0.00> done=<true|false> error=<msg|null>
[END] task=<task_id> success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...>
```

The script uses the OpenAI SDK with Hugging Face Inference Providers by default:

```powershell
$env:ENV_URL = "http://localhost:8000"
$env:API_BASE_URL = "https://router.huggingface.co/v1"
$env:MODEL_NAME = "google/gemma-4-31B-it:fastest"
$env:HF_TOKEN = "<your-hugging-face-token>"
python inference.py
```

Use `$env:TASK_NAME = "<task_id>"` to run one task instead of all three.

## Validation

Useful local checks:

```bash
python -m py_compile server/Cyber_analyst_environment.py inference.py
python -m pytest tests
.\.venv\Scripts\openenv.exe validate . --json
```

## Docker

Build the environment image from this directory:

```bash
docker build -t cyber-analyst-env:latest -f server/Dockerfile .
```

Run:

```bash
docker run -p 8000:8000 cyber-analyst-env:latest
```

Health check:

```bash
curl http://localhost:8000/health
```

## Deployment

Deploy to Hugging Face Spaces with OpenEnv:

```bash
openenv push --repo-id <your-hf-username>/Cyber_analyst
```

The deployed Space exposes `/health`, `/docs`, `/ws`, and the optional `/web` interface when web UI support is enabled by the OpenEnv runtime.

## Adding Scenarios

Add new safe scenarios in `server/tasks.py` by extending `SCENARIOS` with:

- a stable `task_id`
- synthetic `assets`, `repo`, `logs`, `headers`, and `dependencies` entries
- `ground_truth_id`, `finding_type`, `required_evidence`, `impact_keywords`, and `remediation_keywords`

Then add a grader adapter in `server/graders.py` and a matching `tasks` entry in `openenv.yaml`. Keep all artifacts synthetic, keep correctness deterministic, and avoid adding real targets or arbitrary execution tools.