Spaces:
Sleeping
title: Cyber Analyst Environment Server
emoji: 🎯
colorFrom: pink
colorTo: red
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
Cyber Analyst Environment
Cyber Analyst is an OpenEnv implementation of the "SecOps Evidence Gym". It benchmarks a bounded, safe security-triage workflow: investigate synthetic artifacts, cite evidence IDs, validate candidate findings with deterministic verifiers, and submit a remediation report.
The environment contains no live targets, no real secrets, no exploit workflow, no shell, and no outbound investigation tools. All evidence is static synthetic lab data.
Motivation
Frontier models are becoming much stronger at security-relevant reasoning. Anthropic's April 7, 2026 report, Assessing Claude Mythos Preview's cybersecurity capabilities, describes a model that can identify and exploit subtle vulnerabilities across real software targets, and argues that the same capability jump should be directed toward defense.
That creates a practical gap: many modern applications are built quickly, including "vibe coded" apps whose security review may not keep pace with generation speed. This environment is a small, safe training and evaluation surface for the defensive side of that gap. The goal is to help train and benchmark smaller, more accessible models to behave like careful application-security analysts: gather evidence, avoid unsupported claims, validate findings, and recommend concrete fixes.
Environment Description
Each episode simulates a synthetic microservice organization with three services:
gatewayprofile-serviceadmin-service
The agent starts from an alert and can inspect only closed-world artifact collections:
repo_snapshot: static code/config snippetstelemetry: sanitized log eventsheaders: static response-header snapshotsdependencies: static dependency manifest excerpts
The episode budget is 12 steps. Seeds deterministically vary benign details such as service aliases and evidence ordering while keeping the same task ground truth reproducible.
Tasks
The manifest ships three graded tasks:
| Task id | Difficulty | Task description | Expected difficulty |
|---|---|---|---|
secret_exposure_easy |
easy | Find a synthetic API-key-like secret in a repo snapshot and propose removal plus rotation. | Easiest path: one focused search_repo call can surface the relevant evidence, then the agent must create, validate, and report the finding. |
missing_security_headers_medium |
medium | Detect missing HSTS/CSP headers in a synthetic gateway header snapshot. | Requires choosing the purpose-built check_security_headers tool and mapping missing headers to remediation instead of over-searching unrelated artifacts. |
authz_boundary_hard |
hard | Detect an admin route role-policy mismatch without exploitation. | Requires correlating route/role policy evidence with a supporting log event and recommending least-privilege policy remediation plus regression testing. |
Action Space
Each step accepts exactly one bounded simulator tool call:
CyberAnalystAction(
tool_name="search_repo",
args={"query": "api key"},
)
Approved tools:
| Tool | Arguments | Purpose |
|---|---|---|
list_assets |
{} |
List synthetic services, routes, and artifact collections. |
get_log_events |
{"service_id": "str", "query": "str"} |
Return sanitized telemetry evidence IDs for a service/query. |
check_security_headers |
{"service_id": "str"} |
Inspect a service header snapshot and return pass/fail evidence. |
search_repo |
{"query": "str"} |
Search synthetic repo/config snippets for evidence IDs. |
scan_dependencies |
{} |
Inspect a synthetic dependency manifest excerpt. |
create_finding |
{"finding_type": "str", "evidence_ids": ["str"], "severity_guess": "str", "remediation": "str"} |
Store a candidate finding for verifier review. |
validate_finding |
{"finding_id": "str"} |
Run the deterministic verifier for a candidate finding. |
submit_report |
{"report_json": {"findings": [...]}} |
Submit the final structured report and end the episode. |
Unsupported tools return an observation error instead of running arbitrary commands. Repeating the exact same action is penalized, and six repeated identical actions hard-stop the episode.
Observation Space
Each observation is a CyberAnalystObservation with:
| Field | Definition |
|---|---|
task_id |
Current benchmark task ID. |
alert |
Initial alert or task prompt. |
phase |
Current episode phase, usually investigate or done. |
tool_catalog |
Approved tool list and argument schemas. |
tool_result |
Result returned by the latest tool call. |
evidence_ids |
Evidence IDs discovered so far. |
candidate_findings |
Candidate findings created by the agent. |
verified_findings |
Verifier-confirmed findings. |
step_budget_remaining |
Steps remaining before timeout. |
score_breakdown |
Deterministic final scoring explanation after report submission. |
error |
Non-fatal environment error, if any. |
done |
Whether the episode has ended. |
reward |
Step reward clamped to the validator-compatible range. |
submit_report also returns trajectory_jsonl, a JSONL export of the episode events up to report submission. This is intended for offline inspection and future training data extraction.
Scoring
Final reports are scored deterministically:
- base score:
0.05 - verified correct finding with matching report impact:
+0.60 - valid evidence ID in the report:
+0.15 - actionable remediation keywords:
+0.15 - hallucinated or unverified finding claims:
-0.40each - submitting without verifier validation:
-0.20
Rewards and final scores are clamped to 0.01..0.99 for validator compatibility.
Baseline Scores
The current deterministic oracle rollout follows the intended evidence -> finding -> validation -> report path for each task. These scores were measured locally against the environment with seed=7.
| Task id | Baseline type | Steps | Final score | Step rewards |
|---|---|---|---|---|
secret_exposure_easy |
deterministic oracle | 4 | 0.95 |
0.05, 0.06, 0.11, 0.98 |
missing_security_headers_medium |
deterministic oracle | 4 | 0.95 |
0.05, 0.06, 0.11, 0.98 |
authz_boundary_hard |
deterministic oracle | 6 | 0.95 |
0.03, 0.05, 0.05, 0.06, 0.11, 0.98 |
A hallucinated one-step report scores 0.01; repeated identical actions hard-stop at a low score.
Setup
From this directory, install dependencies:
uv sync
Run the local server:
uv run server
Health check:
curl http://localhost:8000/health
Then connect with the client:
from Cyber_analyst import CyberAnalystAction, CyberAnalystEnv
with CyberAnalystEnv(base_url="http://localhost:8000").sync() as env:
result = env.reset(task_id="secret_exposure_easy", seed=7)
result = env.step(CyberAnalystAction(tool_name="search_repo", args={"query": "api key"}))
print(result.observation.tool_result)
Baseline Inference
inference.py runs a model-backed baseline over the configured task set and prints strict parser-friendly logs:
[START] task=<task_id> env=Cyber_analyst model=<model_name>
[STEP] step=<n> action=<compact_json_action> reward=<0.00> done=<true|false> error=<msg|null>
[END] task=<task_id> success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...>
The script uses the OpenAI SDK with Hugging Face Inference Providers by default:
$env:ENV_URL = "http://localhost:8000"
$env:API_BASE_URL = "https://router.huggingface.co/v1"
$env:MODEL_NAME = "google/gemma-4-31B-it:fastest"
$env:HF_TOKEN = "<your-hugging-face-token>"
python inference.py
Use $env:TASK_NAME = "<task_id>" to run one task instead of all three.
Validation
Useful local checks:
python -m py_compile server/Cyber_analyst_environment.py inference.py
python -m pytest tests
.\.venv\Scripts\openenv.exe validate . --json
Docker
Build the environment image from this directory:
docker build -t cyber-analyst-env:latest -f server/Dockerfile .
Run:
docker run -p 8000:8000 cyber-analyst-env:latest
Health check:
curl http://localhost:8000/health
Deployment
Deploy to Hugging Face Spaces with OpenEnv:
openenv push --repo-id <your-hf-username>/Cyber_analyst
The deployed Space exposes /health, /docs, /ws, and the optional /web interface when web UI support is enabled by the OpenEnv runtime.
Adding Scenarios
Add new safe scenarios in server/tasks.py by extending SCENARIOS with:
- a stable
task_id - synthetic
assets,repo,logs,headers, anddependenciesentries ground_truth_id,finding_type,required_evidence,impact_keywords, andremediation_keywords
Then add a grader adapter in server/graders.py and a matching tasks entry in openenv.yaml. Keep all artifacts synthetic, keep correctness deterministic, and avoid adding real targets or arbitrary execution tools.