--- title: Cyber Analyst Environment Server emoji: 🎯 colorFrom: pink colorTo: red sdk: docker pinned: false app_port: 8000 base_path: /web tags: - openenv --- # Cyber Analyst Environment Cyber Analyst is an OpenEnv implementation of the "SecOps Evidence Gym". It benchmarks a bounded, safe security-triage workflow: investigate synthetic artifacts, cite evidence IDs, validate candidate findings with deterministic verifiers, and submit a remediation report. The environment contains no live targets, no real secrets, no exploit workflow, no shell, and no outbound investigation tools. All evidence is static synthetic lab data. ## Motivation Frontier models are becoming much stronger at security-relevant reasoning. Anthropic's April 7, 2026 report, [Assessing Claude Mythos Preview's cybersecurity capabilities](https://red.anthropic.com/2026/mythos-preview/), describes a model that can identify and exploit subtle vulnerabilities across real software targets, and argues that the same capability jump should be directed toward defense. That creates a practical gap: many modern applications are built quickly, including "vibe coded" apps whose security review may not keep pace with generation speed. This environment is a small, safe training and evaluation surface for the defensive side of that gap. The goal is to help train and benchmark smaller, more accessible models to behave like careful application-security analysts: gather evidence, avoid unsupported claims, validate findings, and recommend concrete fixes. ## Environment Description Each episode simulates a synthetic microservice organization with three services: - `gateway` - `profile-service` - `admin-service` The agent starts from an alert and can inspect only closed-world artifact collections: - `repo_snapshot`: static code/config snippets - `telemetry`: sanitized log events - `headers`: static response-header snapshots - `dependencies`: static dependency manifest excerpts The episode budget is 12 steps. Seeds deterministically vary benign details such as service aliases and evidence ordering while keeping the same task ground truth reproducible. ## Tasks The manifest ships three graded tasks: | Task id | Difficulty | Task description | Expected difficulty | | --- | --- | --- | --- | | `secret_exposure_easy` | easy | Find a synthetic API-key-like secret in a repo snapshot and propose removal plus rotation. | Easiest path: one focused `search_repo` call can surface the relevant evidence, then the agent must create, validate, and report the finding. | | `missing_security_headers_medium` | medium | Detect missing HSTS/CSP headers in a synthetic gateway header snapshot. | Requires choosing the purpose-built `check_security_headers` tool and mapping missing headers to remediation instead of over-searching unrelated artifacts. | | `authz_boundary_hard` | hard | Detect an admin route role-policy mismatch without exploitation. | Requires correlating route/role policy evidence with a supporting log event and recommending least-privilege policy remediation plus regression testing. | ## Action Space Each `step` accepts exactly one bounded simulator tool call: ```python CyberAnalystAction( tool_name="search_repo", args={"query": "api key"}, ) ``` Approved tools: | Tool | Arguments | Purpose | | --- | --- | --- | | `list_assets` | `{}` | List synthetic services, routes, and artifact collections. | | `get_log_events` | `{"service_id": "str", "query": "str"}` | Return sanitized telemetry evidence IDs for a service/query. | | `check_security_headers` | `{"service_id": "str"}` | Inspect a service header snapshot and return pass/fail evidence. | | `search_repo` | `{"query": "str"}` | Search synthetic repo/config snippets for evidence IDs. | | `scan_dependencies` | `{}` | Inspect a synthetic dependency manifest excerpt. | | `create_finding` | `{"finding_type": "str", "evidence_ids": ["str"], "severity_guess": "str", "remediation": "str"}` | Store a candidate finding for verifier review. | | `validate_finding` | `{"finding_id": "str"}` | Run the deterministic verifier for a candidate finding. | | `submit_report` | `{"report_json": {"findings": [...]}}` | Submit the final structured report and end the episode. | Unsupported tools return an observation error instead of running arbitrary commands. Repeating the exact same action is penalized, and six repeated identical actions hard-stop the episode. ## Observation Space Each observation is a `CyberAnalystObservation` with: | Field | Definition | | --- | --- | | `task_id` | Current benchmark task ID. | | `alert` | Initial alert or task prompt. | | `phase` | Current episode phase, usually `investigate` or `done`. | | `tool_catalog` | Approved tool list and argument schemas. | | `tool_result` | Result returned by the latest tool call. | | `evidence_ids` | Evidence IDs discovered so far. | | `candidate_findings` | Candidate findings created by the agent. | | `verified_findings` | Verifier-confirmed findings. | | `step_budget_remaining` | Steps remaining before timeout. | | `score_breakdown` | Deterministic final scoring explanation after report submission. | | `error` | Non-fatal environment error, if any. | | `done` | Whether the episode has ended. | | `reward` | Step reward clamped to the validator-compatible range. | `submit_report` also returns `trajectory_jsonl`, a JSONL export of the episode events up to report submission. This is intended for offline inspection and future training data extraction. ## Scoring Final reports are scored deterministically: - base score: `0.05` - verified correct finding with matching report impact: `+0.60` - valid evidence ID in the report: `+0.15` - actionable remediation keywords: `+0.15` - hallucinated or unverified finding claims: `-0.40` each - submitting without verifier validation: `-0.20` Rewards and final scores are clamped to `0.01..0.99` for validator compatibility. ## Baseline Scores The current deterministic oracle rollout follows the intended evidence -> finding -> validation -> report path for each task. These scores were measured locally against the environment with `seed=7`. | Task id | Baseline type | Steps | Final score | Step rewards | | --- | --- | ---: | ---: | --- | | `secret_exposure_easy` | deterministic oracle | 4 | `0.95` | `0.05, 0.06, 0.11, 0.98` | | `missing_security_headers_medium` | deterministic oracle | 4 | `0.95` | `0.05, 0.06, 0.11, 0.98` | | `authz_boundary_hard` | deterministic oracle | 6 | `0.95` | `0.03, 0.05, 0.05, 0.06, 0.11, 0.98` | A hallucinated one-step report scores `0.01`; repeated identical actions hard-stop at a low score. ## Setup From this directory, install dependencies: ```bash uv sync ``` Run the local server: ```bash uv run server ``` Health check: ```bash curl http://localhost:8000/health ``` Then connect with the client: ```python from Cyber_analyst import CyberAnalystAction, CyberAnalystEnv with CyberAnalystEnv(base_url="http://localhost:8000").sync() as env: result = env.reset(task_id="secret_exposure_easy", seed=7) result = env.step(CyberAnalystAction(tool_name="search_repo", args={"query": "api key"})) print(result.observation.tool_result) ``` ## Baseline Inference `inference.py` runs a model-backed baseline over the configured task set and prints strict parser-friendly logs: ```text [START] task= env=Cyber_analyst model= [STEP] step= action= reward=<0.00> done= error= [END] task= success= steps= score=<0.00> rewards= ``` The script uses the OpenAI SDK with Hugging Face Inference Providers by default: ```powershell $env:ENV_URL = "http://localhost:8000" $env:API_BASE_URL = "https://router.huggingface.co/v1" $env:MODEL_NAME = "google/gemma-4-31B-it:fastest" $env:HF_TOKEN = "" python inference.py ``` Use `$env:TASK_NAME = ""` to run one task instead of all three. ## Validation Useful local checks: ```bash python -m py_compile server/Cyber_analyst_environment.py inference.py python -m pytest tests .\.venv\Scripts\openenv.exe validate . --json ``` ## Docker Build the environment image from this directory: ```bash docker build -t cyber-analyst-env:latest -f server/Dockerfile . ``` Run: ```bash docker run -p 8000:8000 cyber-analyst-env:latest ``` Health check: ```bash curl http://localhost:8000/health ``` ## Deployment Deploy to Hugging Face Spaces with OpenEnv: ```bash openenv push --repo-id /Cyber_analyst ``` The deployed Space exposes `/health`, `/docs`, `/ws`, and the optional `/web` interface when web UI support is enabled by the OpenEnv runtime. ## Adding Scenarios Add new safe scenarios in `server/tasks.py` by extending `SCENARIOS` with: - a stable `task_id` - synthetic `assets`, `repo`, `logs`, `headers`, and `dependencies` entries - `ground_truth_id`, `finding_type`, `required_evidence`, `impact_keywords`, and `remediation_keywords` Then add a grader adapter in `server/graders.py` and a matching `tasks` entry in `openenv.yaml`. Keep all artifacts synthetic, keep correctness deterministic, and avoid adding real targets or arbitrary execution tools.