Spaces:

Drac0528
/

CodeSecure

Sleeping

App Files Files Community

Drac0528 commited on Apr 5

Commit

8c391c7

verified ·

1 Parent(s): 3c2807e

Upload 35 files

Browse files

Files changed (35) hide show

Dockerfile +23 -0
README.md +200 -0
VERIFICATION_REPORT.md +35 -0
__init__.py +21 -0
client.py +51 -0
inference.py +220 -0
models.py +90 -0
openenv.yaml +6 -0
pyproject.toml +34 -0
server/Dockerfile +49 -0
server/__init__.py +1 -0
server/__pycache__/__init__.cpython-312.pyc +0 -0
server/__pycache__/__init__.cpython-314.pyc +0 -0
server/__pycache__/app.cpython-314.pyc +0 -0
server/__pycache__/grader.cpython-312.pyc +0 -0
server/__pycache__/grader.cpython-314.pyc +0 -0
server/__pycache__/security_environment.cpython-312.pyc +0 -0
server/__pycache__/security_environment.cpython-314.pyc +0 -0
server/__pycache__/tasks.cpython-312.pyc +0 -0
server/__pycache__/tasks.cpython-314.pyc +0 -0
server/app.py +33 -0
server/grader.py +181 -0
server/security_environment.py +386 -0
server/tasks.py +208 -0
tests/__pycache__/conftest.cpython-312-pytest-7.4.4.pyc +0 -0
tests/__pycache__/conftest.cpython-314-pytest-9.0.2.pyc +0 -0
tests/__pycache__/test_behavioral_scenarios.cpython-312-pytest-7.4.4.pyc +0 -0
tests/__pycache__/test_grader_and_env.cpython-312-pytest-7.4.4.pyc +0 -0
tests/__pycache__/test_grader_and_env.cpython-314-pytest-9.0.2.pyc +0 -0
tests/__pycache__/test_grader_and_env.cpython-314.pyc +0 -0
tests/conftest.py +10 -0
tests/test_behavioral_scenarios.py +476 -0
tests/test_grader_and_env.py +63 -0
uv.lock +0 -0
validate-submission.sh +145 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,23 @@

+FROM python:3.11-slim
+WORKDIR /app
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    curl \
+    ca-certificates \
+    && rm -rf /var/lib/apt/lists/*
+COPY . /app
+RUN pip install --no-cache-dir "openenv-core[core]>=0.2.2" && \
+    pip install --no-cache-dir .
+ENV PYTHONUNBUFFERED=1
+ENV ENABLE_WEB_INTERFACE=true
+EXPOSE 8000
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]

README.md ADDED Viewed

	@@ -0,0 +1,200 @@

+---
+title: Code Security Auditor Environment
+emoji: "🛡️"
+colorFrom: yellow
+colorTo: red
+sdk: docker
+pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
+  - security
+  - code-review
+  - reinforcement-learning
+---
+# Code Security Auditor Environment
+A real-world OpenEnv benchmark where agents perform security auditing on pull-request style code snapshots.
+The agent inspects files, submits vulnerability findings, and finalizes a report. The environment scores by deterministic graders over true vulnerability ground truth with partial credit and anti-reward-hacking penalties.
+## Why this is a real-world task
+Security reviewers and AppSec engineers routinely audit code for vulnerabilities before deployment. This environment models that workflow with concrete exploit classes:
+- SQL injection
+- command injection
+- insecure deserialization
+- weak authentication / auth bypass
+- SSRF
+- path traversal
+- hardcoded secrets
+## OpenEnv Compliance
+- Typed models: CodeSecurityAction, CodeSecurityObservation, CodeSecurityState
+- Core API: reset(), step(), state()
+- OpenEnv manifest: openenv.yaml
+- FastAPI runtime via server.app:app
+## Action Space
+Action model: CodeSecurityAction
+- action_type: inspect_file | submit_finding | submit_final_report
+- filename: target file to inspect or report against
+- line_start, line_end: suspected vulnerable range
+- vuln_type: one of supported vulnerability classes
+- severity: low | medium | high | critical
+- confidence: [0.0, 1.0]
+- evidence, summary: free-form context
+### Action semantics
+- inspect_file: returns full line-numbered file content.
+- submit_finding: grades the finding with deterministic partial credit.
+- submit_final_report: ends the episode and returns final score in [0.0, 1.0].
+## Observation Space
+Observation model: CodeSecurityObservation
+Key fields:
+- task_id, task_title, difficulty, objective
+- available_files
+- focused_file, file_excerpt
+- findings_so_far
+- steps_remaining
+- last_feedback
+- score_hint in [0, 1]
+- reward, done, metadata
+## Tasks and Difficulty
+The environment includes 3 deterministic tasks:
+1. easy: Legacy Flask Patch Review
+2. medium: Payment Webhook Service
+3. hard: Enterprise Multi-Tenant API
+Each task has:
+- realistic multi-file code snapshot
+- hidden vulnerability ground truth
+- deterministic grader with score in [0.0, 1.0]
+## Reward Design
+Reward shaping is trajectory-aware and resistant to reward hacking:
+- inspect_file gives small positive signal for novel, relevant file exploration
+- submit_finding gives partial credit ladder (file -> type -> line -> severity -> confidence calibration)
+- duplicate/low-quality findings reduce quality_multiplier and final score
+- false positives and over-submission reduce precision and final score
+- final score combines weighted recall, precision, structural quality, and calibration
+This creates control and symmetry: spamming findings can increase step count but lowers precision and quality, preventing easy reward exploitation.
+## Baseline Scores
+With deterministic tasks and a simple tool-using model loop, expected baseline tendencies are:
+- easy: high recall, moderate precision
+- medium: moderate recall, moderate precision
+- hard: lower recall, stricter penalties for noisy findings
+Run inference.py to generate reproducible per-task scores for your selected model setup.
+## Setup
+### Option A: Run in-repo (OpenEnv monorepo)
+From repository root:
+```bash
+docker build -t code-security-auditor-env:latest -f envs/code_security_auditor_env/server/Dockerfile .
+docker run -p 8000:8000 code-security-auditor-env:latest
+```
+### Option B: Run standalone
+From this directory:
+```bash
+docker build -t code-security-auditor-env:latest .
+docker run -p 8000:8000 code-security-auditor-env:latest
+```
+## Baseline Inference
+The required script is inference.py in project root (this directory).
+Required env vars:
+- API_BASE_URL
+- MODEL_NAME
+- HF_TOKEN
+Optional env vars:
+- LOCAL_IMAGE_NAME (for from_docker_image mode)
+- ENV_BASE_URL (for connecting to an already-running server)
+- TASK_IDS (comma-separated task ids, default: easy,medium,hard)
+- MAX_STEPS
+Run:
+```bash
+export HF_TOKEN=your_token
+export API_BASE_URL=https://router.huggingface.co/v1
+export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
+export LOCAL_IMAGE_NAME=code-security-auditor-env:latest
+python inference.py
+```
+The script prints only [START], [STEP], and [END] log lines per task.
+## Hugging Face Spaces Deployment
+Space repository:
+- https://huggingface.co/spaces/Drac0528/CodeSecure
+Recommended deploy flow (git push to Space repo):
+```bash
+git clone https://huggingface.co/spaces/Drac0528/CodeSecure
+cd CodeSecure
+cp -R /path/to/code_security_auditor_env/* .
+rm -f .env
+git add .
+git commit -m "Deploy Code Security Auditor OpenEnv"
+git push
+```
+Notes:
+- Keep README frontmatter and Dockerfile at Space repo root.
+- Use Space Settings to set runtime secrets/variables:
+  - HF_TOKEN (Secret)
+  - API_BASE_URL (Variable)
+  - MODEL_NAME (Variable)
+- Ensure Space tags include `openenv`.
+Verify API endpoint after build:
+```bash
+curl -X POST https://drac0528-codesecure.hf.space/reset -H 'Content-Type: application/json' -d '{}'
+```
+## Validation
+Use validate-submission.sh before submitting:
+```bash
+chmod +x validate-submission.sh
+./validate-submission.sh https://drac0528-codesecure.hf.space .
+```

VERIFICATION_REPORT.md ADDED Viewed

	@@ -0,0 +1,35 @@

+# Verification Report
+Date: Sun Apr  5 14:36:53 UTC 2026
+## 1) Unit tests
+PASS
+## 2) OpenEnv validate
+PASS
+## 3) API reset endpoint
+HTTP_STATUS=200
+## 4) Reset response preview
+{"observation":{"task_id":"easy","task_title":"Legacy Flask Patch Review","difficulty":"easy","objective":"Audit a small Flask service before release. Identify concrete security vulnerabilities with accurate location, type, and severity.","instructions":"Valid actions: inspect_file, submit_finding, submit_final_report. For submit_finding include filename, line_start/line_end, vuln_type, severity, confidence.","available_files":["app/config.py","app/db.py","app/routes.py"],"focused_file":null,"fi
+## 5) Raw command excerpts
+### pytest
+..                                                                       [100%]
+=============================== warnings summary ===============================
+../../../../../opt/anaconda3/lib/python3.12/site-packages/_pytest/config/__init__.py:1373
+  /opt/anaconda3/lib/python3.12/site-packages/_pytest/config/__init__.py:1373: PytestConfigWarning: Unknown config option: asyncio_default_fixture_loop_scope
+    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")
+../../../../../opt/anaconda3/lib/python3.12/site-packages/_pytest/config/__init__.py:1373
+  /opt/anaconda3/lib/python3.12/site-packages/_pytest/config/__init__.py:1373: PytestConfigWarning: Unknown config option: asyncio_mode
+    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")
+-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
+2 passed, 2 warnings in 1.67s
+### openenv validate
+[OK] code_security_auditor: Ready for multi-mode deployment

__init__.py ADDED Viewed

	@@ -0,0 +1,21 @@

+"""Code Security Auditor Environment package."""
+from .models import (
+    CodeSecurityAction,
+    CodeSecurityObservation,
+    CodeSecurityState,
+    FindingRecord,
+)
+try:
+    from .client import CodeSecurityAuditorEnv
+except Exception:  # pragma: no cover - optional during local/model-only imports
+    CodeSecurityAuditorEnv = None
+__all__ = [
+    "CodeSecurityAuditorEnv",
+    "CodeSecurityAction",
+    "CodeSecurityObservation",
+    "CodeSecurityState",
+    "FindingRecord",
+]

client.py ADDED Viewed

	@@ -0,0 +1,51 @@

+from __future__ import annotations
+from typing import Dict
+try:
+    from core.client_types import StepResult
+    from core.env_client import EnvClient
+except ImportError:
+    from openenv.core.client_types import StepResult
+    from openenv.core.env_client import EnvClient
+try:
+    from .models import CodeSecurityAction, CodeSecurityObservation, CodeSecurityState
+except ImportError:
+    from models import CodeSecurityAction, CodeSecurityObservation, CodeSecurityState
+class CodeSecurityAuditorEnv(
+    EnvClient[CodeSecurityAction, CodeSecurityObservation, CodeSecurityState]
+):
+    """Client wrapper for the Code Security Auditor environment server."""
+    def _step_payload(self, action: CodeSecurityAction) -> dict:
+        payload = {
+            "action_type": action.action_type,
+            "confidence": action.confidence,
+            "evidence": action.evidence,
+            "summary": action.summary,
+        }
+        if action.filename is not None:
+            payload["filename"] = action.filename
+        if action.line_start is not None:
+            payload["line_start"] = action.line_start
+        if action.line_end is not None:
+            payload["line_end"] = action.line_end
+        if action.vuln_type is not None:
+            payload["vuln_type"] = action.vuln_type
+        if action.severity is not None:
+            payload["severity"] = action.severity
+        return payload
+    def _parse_result(self, payload: Dict) -> StepResult[CodeSecurityObservation]:
+        observation = CodeSecurityObservation(**payload.get("observation", {}))
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=bool(payload.get("done", False)),
+        )
+    def _parse_state(self, payload: Dict) -> CodeSecurityState:
+        return CodeSecurityState(**payload)

inference.py ADDED Viewed

	@@ -0,0 +1,220 @@

+#!/usr/bin/env python3
+from __future__ import annotations
+import asyncio
+import json
+import os
+from typing import Any, Dict, List, Optional
+from openai import OpenAI
+try:
+    from code_security_auditor_env import CodeSecurityAction, CodeSecurityAuditorEnv
+except ImportError:
+    from client import CodeSecurityAuditorEnv
+    from models import CodeSecurityAction
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
+ENV_BASE_URL = os.getenv("ENV_BASE_URL")
+TASK_IDS = [t.strip() for t in os.getenv("TASK_IDS", "easy,medium,hard").split(",") if t.strip()]
+MAX_STEPS = int(os.getenv("MAX_STEPS", "12"))
+TEMPERATURE = 0.0
+MAX_TOKENS = 260
+BENCHMARK = "code_security_auditor_env"
+SYSTEM_PROMPT = (
+    "You are a senior application security reviewer. Produce strictly valid JSON for the next action. "
+    "Allowed action_type values: inspect_file, submit_finding, submit_final_report. "
+    "Do not include markdown fences. Keep fields concise and accurate."
+)
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    err = error if error else "null"
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={str(done).lower()} error={err}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
+def _compact_action_str(action: Dict[str, Any]) -> str:
+    return json.dumps(action, separators=(",", ":"), ensure_ascii=True)
+def _default_action() -> Dict[str, Any]:
+    return {
+        "action_type": "submit_final_report",
+        "confidence": 0.5,
+        "summary": "fallback-finalize",
+        "evidence": "fallback-finalize",
+    }
+def _parse_action(raw: str, available_files: List[str]) -> Dict[str, Any]:
+    try:
+        parsed = json.loads(raw)
+        if not isinstance(parsed, dict):
+            return _default_action()
+    except Exception:
+        return _default_action()
+    action_type = parsed.get("action_type")
+    if action_type not in {"inspect_file", "submit_finding", "submit_final_report"}:
+        return _default_action()
+    action: Dict[str, Any] = {
+        "action_type": action_type,
+        "confidence": float(parsed.get("confidence", 0.5)),
+        "summary": str(parsed.get("summary", ""))[:400],
+        "evidence": str(parsed.get("evidence", ""))[:700],
+    }
+    if parsed.get("filename"):
+        filename = str(parsed["filename"])
+        if filename in available_files:
+            action["filename"] = filename
+    if parsed.get("line_start") is not None:
+        try:
+            action["line_start"] = max(1, int(parsed["line_start"]))
+        except Exception:
+            pass
+    if parsed.get("line_end") is not None:
+        try:
+            action["line_end"] = max(1, int(parsed["line_end"]))
+        except Exception:
+            pass
+    if parsed.get("vuln_type") is not None:
+        action["vuln_type"] = str(parsed["vuln_type"])
+    if parsed.get("severity") is not None:
+        action["severity"] = str(parsed["severity"])
+    action["confidence"] = min(1.0, max(0.0, action["confidence"]))
+    return action
+def _build_prompt(obs: Any, step: int) -> str:
+    findings = obs.findings_so_far[-4:] if obs.findings_so_far else []
+    snippet = obs.file_excerpt[:1800] if obs.file_excerpt else ""
+    return (
+        f"Task: {obs.task_id} ({obs.difficulty})\\n"
+        f"Objective: {obs.objective}\\n"
+        f"Step: {step}\\n"
+        f"Steps remaining: {obs.steps_remaining}\\n"
+        f"Files: {', '.join(obs.available_files)}\\n"
+        f"Last feedback: {obs.last_feedback}\\n"
+        f"Focused file: {obs.focused_file}\\n"
+        f"Recent findings: {json.dumps(findings)}\\n"
+        f"Visible snippet:\\n{snippet}\\n"
+        "Return one JSON object with action_type and required fields."
+    )
+def _query_model(client: OpenAI, obs: Any, step: int) -> Dict[str, Any]:
+    user_prompt = _build_prompt(obs, step)
+    try:
+        resp = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_prompt},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            stream=False,
+        )
+        content = (resp.choices[0].message.content or "").strip()
+        return _parse_action(content, obs.available_files)
+    except Exception:
+        return _default_action()
+async def _create_env() -> CodeSecurityAuditorEnv:
+    if LOCAL_IMAGE_NAME:
+        return await CodeSecurityAuditorEnv.from_docker_image(LOCAL_IMAGE_NAME)
+    if ENV_BASE_URL:
+        return CodeSecurityAuditorEnv(base_url=ENV_BASE_URL)
+    raise RuntimeError(
+        "Set LOCAL_IMAGE_NAME (docker mode) or ENV_BASE_URL (remote mode) to run inference."
+    )
+async def run_task(env: CodeSecurityAuditorEnv, client: OpenAI, task_id: str) -> float:
+    log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    try:
+        result = await env.reset(task_id=task_id)
+        obs = result.observation
+        for step in range(1, MAX_STEPS + 1):
+            if result.done:
+                break
+            action_dict = _query_model(client, obs, step)
+            action_str = _compact_action_str(action_dict)
+            action = CodeSecurityAction(**action_dict)
+            result = await env.step(action)
+            obs = result.observation
+            reward = float(result.reward or 0.0)
+            done = bool(result.done)
+            error = obs.metadata.get("last_action_error")
+            rewards.append(reward)
+            steps_taken = step
+            log_step(step=step, action=action_str, reward=reward, done=done, error=error)
+            if done:
+                break
+        score = float(obs.reward or 0.0)
+        score = min(max(score, 0.0), 1.0)
+        success = score >= 0.6
+    finally:
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    return score
+async def main() -> None:
+    if not API_KEY:
+        raise RuntimeError("HF_TOKEN (or API_KEY) is required for inference.")
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = await _create_env()
+    try:
+        scores: List[float] = []
+        for task_id in TASK_IDS:
+            score = await run_task(env, client, task_id)
+            scores.append(score)
+        # Keep strict output format requirement: no extra structured tags beyond START/STEP/END.
+        _ = scores
+    finally:
+        await env.close()
+if __name__ == "__main__":
+    asyncio.run(main())

models.py ADDED Viewed

	@@ -0,0 +1,90 @@

+from __future__ import annotations
+from typing import Any, Dict, List, Literal, Optional
+from pydantic import Field
+try:
+    from core.env_server.types import Action, Observation, State
+except ImportError:
+    try:
+        from openenv.core.env_server.types import Action, Observation, State
+    except ImportError:
+        from openenv_core.env_server.types import Action, Observation, State
+ActionType = Literal["inspect_file", "submit_finding", "submit_final_report"]
+VulnerabilityType = Literal[
+    "sql_injection",
+    "command_injection",
+    "path_traversal",
+    "weak_authentication",
+    "insecure_deserialization",
+    "ssrf",
+    "hardcoded_secret",
+    "xss",
+]
+Severity = Literal["low", "medium", "high", "critical"]
+class CodeSecurityAction(Action):
+    """Action sent by the agent during a security audit episode."""
+    action_type: ActionType
+    filename: Optional[str] = None
+    line_start: Optional[int] = Field(default=None, ge=1)
+    line_end: Optional[int] = Field(default=None, ge=1)
+    vuln_type: Optional[VulnerabilityType] = None
+    severity: Optional[Severity] = None
+    confidence: float = Field(default=0.5, ge=0.0, le=1.0)
+    evidence: str = ""
+    summary: str = ""
+class FindingRecord(State):
+    """Stored record of one submitted finding."""
+    finding_id: str
+    filename: str
+    line_start: int
+    line_end: int
+    vuln_type: str
+    severity: str
+    confidence: float
+    evidence: str
+    summary: str
+    matched_vulnerability_id: Optional[str] = None
+    component_score: float = 0.0
+class CodeSecurityObservation(Observation):
+    """Observation returned after reset() and step()."""
+    task_id: str
+    task_title: str
+    difficulty: str
+    objective: str
+    instructions: str
+    available_files: List[str] = Field(default_factory=list)
+    focused_file: Optional[str] = None
+    file_excerpt: str = ""
+    findings_so_far: List[Dict[str, Any]] = Field(default_factory=list)
+    steps_remaining: int = 0
+    last_feedback: str = ""
+    score_hint: float = Field(default=0.0, ge=0.0, le=1.0)
+class CodeSecurityState(State):
+    """Internal environment state for the current security auditing episode."""
+    task_id: str = ""
+    task_title: str = ""
+    difficulty: str = ""
+    objective: str = ""
+    max_steps: int = 0
+    inspected_files: List[str] = Field(default_factory=list)
+    findings_submitted: List[FindingRecord] = Field(default_factory=list)
+    matched_vulnerability_ids: List[str] = Field(default_factory=list)
+    false_positive_count: int = 0
+    duplicate_submission_count: int = 0
+    quality_multiplier: float = 1.0
+    final_score: Optional[float] = None

openenv.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+spec_version: 1
+name: code_security_auditor_env
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000

pyproject.toml ADDED Viewed

	@@ -0,0 +1,34 @@

+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-code-security-auditor-env"
+version = "0.1.0"
+description = "Code Security Auditor Environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    "openenv-core[core]>=0.2.2",
+    "fastapi>=0.115.0",
+    "pydantic>=2.0.0",
+    "uvicorn[standard]>=0.24.0",
+    "requests>=2.31.0",
+    "openai>=1.40.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+server = "code_security_auditor_env.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["code_security_auditor_env", "code_security_auditor_env.server"]
+package-dir = { "code_security_auditor_env" = ".", "code_security_auditor_env.server" = "server" }
+[tool.setuptools.package-data]
+code_security_auditor_env = ["**/*.yaml", "**/*.yml"]

server/Dockerfile ADDED Viewed

	@@ -0,0 +1,49 @@

+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+COPY envs/code_security_auditor_env /app/env
+WORKDIR /app/env
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git \
+    curl \
+    ca-certificates \
+    && rm -rf /var/lib/apt/lists/*
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+FROM ${BASE_IMAGE}
+WORKDIR /app
+COPY --from=builder /app/env/.venv /app/.venv
+COPY --from=builder /app/env /app/env
+ENV PATH="/app/.venv/bin:$PATH"
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+ENV PYTHONUNBUFFERED=1
+ENV ENABLE_WEB_INTERFACE=true
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Server package for Code Security Auditor environment."""

server/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (263 Bytes). View file

server/__pycache__/__init__.cpython-314.pyc ADDED Viewed

Binary file (187 Bytes). View file

server/__pycache__/app.cpython-314.pyc ADDED Viewed

Binary file (1.34 kB). View file

server/__pycache__/grader.cpython-312.pyc ADDED Viewed

Binary file (6.56 kB). View file

server/__pycache__/grader.cpython-314.pyc ADDED Viewed

Binary file (7.38 kB). View file

server/__pycache__/security_environment.cpython-312.pyc ADDED Viewed

Binary file (17.9 kB). View file

server/__pycache__/security_environment.cpython-314.pyc ADDED Viewed

Binary file (20.1 kB). View file

server/__pycache__/tasks.cpython-312.pyc ADDED Viewed

Binary file (8.78 kB). View file

server/__pycache__/tasks.cpython-314.pyc ADDED Viewed

Binary file (9.1 kB). View file

server/app.py ADDED Viewed

	@@ -0,0 +1,33 @@

+from __future__ import annotations
+try:
+    from core.env_server.http_server import create_app
+except ImportError:
+    try:
+        from openenv.core.env_server.http_server import create_app
+    except ImportError:
+        from openenv_core.env_server.http_server import create_app
+try:
+    from ..models import CodeSecurityAction, CodeSecurityObservation
+    from .security_environment import CodeSecurityAuditorEnvironment
+except ImportError:
+    from models import CodeSecurityAction, CodeSecurityObservation
+    from server.security_environment import CodeSecurityAuditorEnvironment
+app = create_app(
+    CodeSecurityAuditorEnvironment,
+    CodeSecurityAction,
+    CodeSecurityObservation,
+    env_name="code_security_auditor_env",
+)
+def main() -> None:
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)
+if __name__ == "__main__":
+    main()

server/grader.py ADDED Viewed

	@@ -0,0 +1,181 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Iterable, Optional
+from .tasks import SEVERITY_WEIGHTS, TARGET_CONFIDENCE, TaskSpec, VulnerabilitySpec
+@dataclass(frozen=True)
+class FindingEvaluation:
+    component_score: float
+    matched_vulnerability_id: Optional[str]
+    is_confirmed_match: bool
+    feedback: str
+    confidence_calibration: float
+def _line_overlap_score(submitted_start: int, submitted_end: int, target_line: int) -> float:
+    if submitted_start <= target_line <= submitted_end:
+        return 1.0
+    min_distance = min(abs(target_line - submitted_start), abs(target_line - submitted_end))
+    if min_distance <= 2:
+        return 0.6
+    if min_distance <= 5:
+        return 0.3
+    return 0.0
+def _best_candidate(
+    task: TaskSpec,
+    filename: str,
+    vuln_type: str,
+    severity: str,
+    line_start: int,
+    line_end: int,
+) -> tuple[Optional[VulnerabilitySpec], float, float, float, float]:
+    best_target = None
+    best_score = -1.0
+    best_type_match = 0.0
+    best_line_match = 0.0
+    best_severity_match = 0.0
+    for target in task.vulnerabilities:
+        file_match = 1.0 if target.filename == filename else 0.0
+        type_match = 1.0 if target.vuln_type == vuln_type else 0.0
+        severity_match = 1.0 if target.severity == severity else 0.0
+        line_match = _line_overlap_score(line_start, line_end, target.line)
+        candidate_score = (
+            0.35 * file_match
+            + 0.30 * type_match
+            + 0.20 * line_match
+            + 0.15 * severity_match
+        )
+        if candidate_score > best_score:
+            best_score = candidate_score
+            best_target = target
+            best_type_match = type_match
+            best_line_match = line_match
+            best_severity_match = severity_match
+    return best_target, max(best_score, 0.0), best_type_match, best_line_match, best_severity_match
+def evaluate_finding(
+    *,
+    task: TaskSpec,
+    filename: str,
+    vuln_type: str,
+    severity: str,
+    line_start: int,
+    line_end: int,
+    confidence: float,
+    matched_already: Iterable[str],
+) -> FindingEvaluation:
+    target, structure_score, type_match, line_match, severity_match = _best_candidate(
+        task,
+        filename,
+        vuln_type,
+        severity,
+        line_start,
+        line_end,
+    )
+    if target is None:
+        return FindingEvaluation(
+            component_score=0.0,
+            matched_vulnerability_id=None,
+            is_confirmed_match=False,
+            feedback="No plausible vulnerability match for this finding.",
+            confidence_calibration=0.0,
+        )
+    target_conf = TARGET_CONFIDENCE[target.severity]
+    calibration = max(0.0, 1.0 - abs(confidence - target_conf))
+    component_score = 0.8 * structure_score + 0.2 * calibration
+    component_score = max(0.0, min(1.0, component_score))
+    confirmed = (
+        target.filename == filename
+        and type_match == 1.0
+        and line_match >= 0.6
+        and severity_match == 1.0
+    )
+    if target.id in set(matched_already) and confirmed:
+        return FindingEvaluation(
+            component_score=0.25 * component_score,
+            matched_vulnerability_id=target.id,
+            is_confirmed_match=False,
+            feedback="Duplicate of a previously confirmed vulnerability.",
+            confidence_calibration=calibration,
+        )
+    if confirmed:
+        return FindingEvaluation(
+            component_score=component_score,
+            matched_vulnerability_id=target.id,
+            is_confirmed_match=True,
+            feedback="Confirmed vulnerability: file/type/line/severity align with ground truth.",
+            confidence_calibration=calibration,
+        )
+    if target.filename != filename:
+        hint = "Wrong file."
+    elif type_match == 0.0:
+        hint = "Correct file, vulnerability type mismatch."
+    elif line_match < 0.6:
+        hint = "Correct file/type, but location is off."
+    elif severity_match == 0.0:
+        hint = "Severity mismatch."
+    else:
+        hint = "Partial match, refine details."
+    return FindingEvaluation(
+        component_score=component_score,
+        matched_vulnerability_id=None,
+        is_confirmed_match=False,
+        feedback=hint,
+        confidence_calibration=calibration,
+    )
+def final_grade(
+    *,
+    task: TaskSpec,
+    confirmed_vulnerability_ids: Iterable[str],
+    findings_count: int,
+    false_positive_count: int,
+    duplicate_count: int,
+    avg_component_score: float,
+    avg_confidence_calibration: float,
+) -> float:
+    confirmed_ids = set(confirmed_vulnerability_ids)
+    total_weight = sum(SEVERITY_WEIGHTS[v.severity] for v in task.vulnerabilities)
+    covered_weight = sum(
+        SEVERITY_WEIGHTS[v.severity] for v in task.vulnerabilities if v.id in confirmed_ids
+    )
+    weighted_recall = (covered_weight / total_weight) if total_weight > 0 else 0.0
+    precision = (len(confirmed_ids) / findings_count) if findings_count > 0 else 0.0
+    fp_penalty = min(0.5, 0.08 * false_positive_count)
+    dup_penalty = min(0.2, 0.05 * duplicate_count)
+    volume_penalty = 0.0
+    optimal_findings = len(task.vulnerabilities) + 1
+    if findings_count > optimal_findings:
+        volume_penalty = min(0.2, 0.03 * (findings_count - optimal_findings))
+    score = (
+        0.55 * weighted_recall
+        + 0.20 * precision
+        + 0.15 * avg_component_score
+        + 0.10 * avg_confidence_calibration
+    )
+    score -= fp_penalty + dup_penalty + volume_penalty
+    return max(0.0, min(1.0, score))

server/security_environment.py ADDED Viewed

	@@ -0,0 +1,386 @@

+from __future__ import annotations
+import random
+import uuid
+from typing import Any, Optional
+try:
+    from core.env_server.interfaces import Environment
+except ImportError:
+    try:
+        from openenv.core.env_server.interfaces import Environment
+    except ImportError:
+        from openenv_core.env_server.interfaces import Environment
+try:
+    from ..models import (
+        CodeSecurityAction,
+        CodeSecurityObservation,
+        CodeSecurityState,
+        FindingRecord,
+    )
+    from .grader import evaluate_finding, final_grade
+    from .tasks import TaskSpec, get_task, list_task_ids
+except ImportError:
+    from models import (
+        CodeSecurityAction,
+        CodeSecurityObservation,
+        CodeSecurityState,
+        FindingRecord,
+    )
+    from server.grader import evaluate_finding, final_grade
+    from server.tasks import TaskSpec, get_task, list_task_ids
+class CodeSecurityAuditorEnvironment(
+    Environment[CodeSecurityAction, CodeSecurityObservation, CodeSecurityState]
+):
+    """Real-world code security auditing simulator with deterministic graders."""
+    SUPPORTS_CONCURRENT_SESSIONS = True
+    def __init__(self, default_task_id: str = "easy"):
+        self._default_task_id = default_task_id
+        self._task_cursor = 0
+        self._task: Optional[TaskSpec] = None
+        self._state = CodeSecurityState()
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs: Any,
+    ) -> CodeSecurityObservation:
+        requested_task = kwargs.get("task_id") or kwargs.get("task")
+        if requested_task is not None:
+            task = get_task(str(requested_task))
+        elif seed is not None:
+            rng = random.Random(seed)
+            task = get_task(rng.choice(list_task_ids()))
+        elif self._default_task_id:
+            task = get_task(self._default_task_id)
+        else:
+            task_order = list_task_ids()
+            task = get_task(task_order[self._task_cursor % len(task_order)])
+            self._task_cursor += 1
+        self._task = task
+        self._state = CodeSecurityState(
+            episode_id=episode_id or str(uuid.uuid4()),
+            step_count=0,
+            task_id=task.id,
+            task_title=task.title,
+            difficulty=task.difficulty,
+            objective=task.objective,
+            max_steps=task.max_steps,
+            inspected_files=[],
+            findings_submitted=[],
+            matched_vulnerability_ids=[],
+            false_positive_count=0,
+            duplicate_submission_count=0,
+            quality_multiplier=1.0,
+            final_score=None,
+        )
+        return self._build_observation(
+            reward=0.0,
+            done=False,
+            feedback=(
+                "Audit started. Use inspect_file before submit_finding. "
+                "Finish with submit_final_report."
+            ),
+            focused_file=None,
+            excerpt="",
+            extra_metadata={
+                "available_task_ids": list_task_ids(),
+                "task_id": task.id,
+            },
+        )
+    def step(
+        self,
+        action: CodeSecurityAction,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> CodeSecurityObservation:
+        del timeout_s, kwargs
+        task = self._require_task()
+        if self._state.final_score is not None:
+            return self._build_observation(
+                reward=0.0,
+                done=True,
+                feedback="Episode already terminated. Call reset() to start a new task.",
+                focused_file=None,
+                excerpt="",
+            )
+        self._state.step_count += 1
+        feedback = ""
+        reward = 0.0
+        focused_file = None
+        excerpt = ""
+        if action.action_type == "inspect_file":
+            reward, feedback, focused_file, excerpt = self._handle_inspect_file(action, task)
+        elif action.action_type == "submit_finding":
+            reward, feedback = self._handle_submit_finding(action, task)
+        elif action.action_type == "submit_final_report":
+            reward, feedback = self._handle_submit_final_report()
+        else:
+            feedback = f"Unsupported action_type={action.action_type}."
+            self._degrade_quality(0.03)
+        done = self._state.final_score is not None
+        if not done and self._state.step_count >= self._state.max_steps:
+            score = self._compute_final_score(task)
+            self._state.final_score = score
+            done = True
+            reward = score
+            feedback = (
+                f"Max steps reached. Auto-finalized audit score={score:.3f}. "
+                "Use fewer but higher-quality findings to improve precision."
+            )
+        return self._build_observation(
+            reward=reward,
+            done=done,
+            feedback=feedback,
+            focused_file=focused_file,
+            excerpt=excerpt,
+            extra_metadata={
+                "last_action_error": None,
+            },
+        )
+    @property
+    def state(self) -> CodeSecurityState:
+        return self._state
+    def _require_task(self) -> TaskSpec:
+        if self._task is None:
+            raise RuntimeError("Environment has no active task. Call reset() first.")
+        return self._task
+    def _degrade_quality(self, amount: float) -> None:
+        self._state.quality_multiplier = max(0.2, self._state.quality_multiplier - amount)
+    def _format_file(self, content: str) -> str:
+        lines = content.splitlines()
+        numbered = [f"{idx + 1:>3}: {line}" for idx, line in enumerate(lines)]
+        return "\n".join(numbered)
+    def _handle_inspect_file(
+        self,
+        action: CodeSecurityAction,
+        task: TaskSpec,
+    ) -> tuple[float, str, Optional[str], str]:
+        filename = action.filename or ""
+        if filename not in task.repository:
+            self._degrade_quality(0.04)
+            return 0.0, f"Unknown file '{filename}'.", filename or None, ""
+        first_time = filename not in self._state.inspected_files
+        if first_time:
+            self._state.inspected_files.append(filename)
+        excerpt = self._format_file(task.repository[filename])
+        unmatched_in_file = any(
+            vuln.filename == filename and vuln.id not in self._state.matched_vulnerability_ids
+            for vuln in task.vulnerabilities
+        )
+        if first_time and unmatched_in_file:
+            reward = 0.04
+            feedback = "Useful inspection: this file likely contains unresolved security issues."
+        elif first_time:
+            reward = 0.02
+            feedback = "Inspection noted. No strong security signal yet."
+        else:
+            reward = 0.0
+            feedback = "File already inspected; repeated reads do not improve score."
+            self._degrade_quality(0.01)
+        return reward, feedback, filename, excerpt
+    def _handle_submit_finding(
+        self,
+        action: CodeSecurityAction,
+        task: TaskSpec,
+    ) -> tuple[float, str]:
+        required_missing = []
+        if not action.filename:
+            required_missing.append("filename")
+        if action.line_start is None:
+            required_missing.append("line_start")
+        if not action.vuln_type:
+            required_missing.append("vuln_type")
+        if not action.severity:
+            required_missing.append("severity")
+        if required_missing:
+            self._degrade_quality(0.05)
+            missing = ", ".join(required_missing)
+            return 0.0, f"Incomplete finding. Missing fields: {missing}."
+        line_end = action.line_end if action.line_end is not None else action.line_start
+        evaluation = evaluate_finding(
+            task=task,
+            filename=action.filename,
+            vuln_type=action.vuln_type,
+            severity=action.severity,
+            line_start=action.line_start,
+            line_end=line_end,
+            confidence=action.confidence,
+            matched_already=self._state.matched_vulnerability_ids,
+        )
+        finding_id = f"finding-{len(self._state.findings_submitted) + 1}"
+        finding_record = FindingRecord(
+            finding_id=finding_id,
+            filename=action.filename,
+            line_start=action.line_start,
+            line_end=line_end,
+            vuln_type=action.vuln_type,
+            severity=action.severity,
+            confidence=action.confidence,
+            evidence=(action.evidence or "").strip(),
+            summary=(action.summary or "").strip(),
+            matched_vulnerability_id=evaluation.matched_vulnerability_id,
+            component_score=evaluation.component_score,
+        )
+        self._state.findings_submitted.append(finding_record)
+        if evaluation.is_confirmed_match and evaluation.matched_vulnerability_id is not None:
+            self._state.matched_vulnerability_ids.append(evaluation.matched_vulnerability_id)
+            reward = min(1.0, (0.25 + 0.75 * evaluation.component_score) * self._state.quality_multiplier)
+            feedback = (
+                f"{evaluation.feedback} "
+                f"Confirmed={len(self._state.matched_vulnerability_ids)}/{len(task.vulnerabilities)}."
+            )
+            return reward, feedback
+        if (
+            evaluation.matched_vulnerability_id is not None
+            and evaluation.matched_vulnerability_id in self._state.matched_vulnerability_ids
+        ):
+            self._state.duplicate_submission_count += 1
+            self._degrade_quality(0.04)
+            return 0.01, evaluation.feedback
+        if evaluation.component_score >= 0.45:
+            self._degrade_quality(0.01)
+            reward = min(0.2, 0.2 * evaluation.component_score * self._state.quality_multiplier)
+            return reward, f"Partial progress: {evaluation.feedback}"
+        self._state.false_positive_count += 1
+        self._degrade_quality(0.05)
+        return 0.0, f"Likely false positive: {evaluation.feedback}"
+    def _handle_submit_final_report(self) -> tuple[float, str]:
+        task = self._require_task()
+        score = self._compute_final_score(task)
+        self._state.final_score = score
+        feedback = (
+            f"Audit finalized. Final deterministic score={score:.3f}. "
+            f"Confirmed {len(self._state.matched_vulnerability_ids)} of {len(task.vulnerabilities)} vulnerabilities."
+        )
+        return score, feedback
+    def _compute_final_score(self, task: TaskSpec) -> float:
+        if self._state.findings_submitted:
+            avg_component = sum(f.component_score for f in self._state.findings_submitted) / len(
+                self._state.findings_submitted
+            )
+        else:
+            avg_component = 0.0
+        if self._state.findings_submitted:
+            avg_calibration = sum(
+                max(0.0, 1.0 - abs(f.confidence - 0.75)) for f in self._state.findings_submitted
+            ) / len(self._state.findings_submitted)
+        else:
+            avg_calibration = 0.0
+        score = final_grade(
+            task=task,
+            confirmed_vulnerability_ids=self._state.matched_vulnerability_ids,
+            findings_count=len(self._state.findings_submitted),
+            false_positive_count=self._state.false_positive_count,
+            duplicate_count=self._state.duplicate_submission_count,
+            avg_component_score=avg_component,
+            avg_confidence_calibration=avg_calibration,
+        )
+        # This quality factor makes spam and random guesses strictly dominated,
+        # limiting reward hacking while preserving partial-credit gradients.
+        score *= self._state.quality_multiplier
+        return max(0.0, min(1.0, score))
+    def _build_observation(
+        self,
+        *,
+        reward: float,
+        done: bool,
+        feedback: str,
+        focused_file: Optional[str],
+        excerpt: str,
+        extra_metadata: Optional[dict[str, Any]] = None,
+    ) -> CodeSecurityObservation:
+        task = self._require_task()
+        findings_public = [
+            {
+                "finding_id": f.finding_id,
+                "filename": f.filename,
+                "line_start": f.line_start,
+                "line_end": f.line_end,
+                "vuln_type": f.vuln_type,
+                "severity": f.severity,
+                "confidence": f.confidence,
+                "component_score": round(f.component_score, 3),
+            }
+            for f in self._state.findings_submitted
+        ]
+        score_hint = len(self._state.matched_vulnerability_ids) / max(1, len(task.vulnerabilities))
+        metadata = {
+            "quality_multiplier": round(self._state.quality_multiplier, 4),
+            "false_positive_count": self._state.false_positive_count,
+            "duplicate_submission_count": self._state.duplicate_submission_count,
+            "confirmed_vulnerabilities": len(self._state.matched_vulnerability_ids),
+            "total_vulnerabilities": len(task.vulnerabilities),
+            "task_id": task.id,
+            "difficulty": task.difficulty,
+            "available_task_ids": list_task_ids(),
+            "last_action_error": None,
+        }
+        if extra_metadata:
+            metadata.update(extra_metadata)
+        return CodeSecurityObservation(
+            done=done,
+            reward=max(0.0, min(1.0, reward)),
+            metadata=metadata,
+            task_id=task.id,
+            task_title=task.title,
+            difficulty=task.difficulty,
+            objective=task.objective,
+            instructions=(
+                "Valid actions: inspect_file, submit_finding, submit_final_report. "
+                "For submit_finding include filename, line_start/line_end, vuln_type, severity, confidence."
+            ),
+            available_files=sorted(task.repository.keys()),
+            focused_file=focused_file,
+            file_excerpt=excerpt,
+            findings_so_far=findings_public,
+            steps_remaining=max(0, self._state.max_steps - self._state.step_count),
+            last_feedback=feedback,
+            score_hint=max(0.0, min(1.0, score_hint)),
+        )

server/tasks.py ADDED Viewed

	@@ -0,0 +1,208 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Dict, List, Literal
+Difficulty = Literal["easy", "medium", "hard"]
+Severity = Literal["low", "medium", "high", "critical"]
+@dataclass(frozen=True)
+class VulnerabilitySpec:
+    id: str
+    filename: str
+    line: int
+    vuln_type: str
+    severity: Severity
+    title: str
+    rationale: str
+@dataclass(frozen=True)
+class TaskSpec:
+    id: str
+    title: str
+    difficulty: Difficulty
+    objective: str
+    max_steps: int
+    repository: Dict[str, str]
+    vulnerabilities: List[VulnerabilitySpec]
+TASKS: Dict[str, TaskSpec] = {
+    "easy": TaskSpec(
+        id="easy",
+        title="Legacy Flask Patch Review",
+        difficulty="easy",
+        objective=(
+            "Audit a small Flask service before release. Identify concrete "
+            "security vulnerabilities with accurate location, type, and severity."
+        ),
+        max_steps=12,
+        repository={
+            "app/routes.py": """from flask import Blueprint, request, jsonify\nfrom app.db import db\n\nbp = Blueprint('api', __name__)\n\n@bp.get('/user')\ndef get_user():\n    user_id = request.args.get('id', '')\n    query = f\"SELECT id, email, role FROM users WHERE id = '{user_id}'\"\n    row = db.execute(query).fetchone()\n    return jsonify(dict(row) if row else {})\n\n@bp.post('/login')\ndef login():\n    payload = request.json or {}\n    if payload.get('token') == 'letmein':\n        return jsonify({'ok': True})\n    return jsonify({'ok': False}), 401\n""",
+            "app/config.py": """import os\n\nclass Config:\n    DEBUG = os.getenv('DEBUG', '0') == '1'\n    SECRET_KEY = 'prod-secret-2026'\n    DB_URL = os.getenv('DB_URL', 'postgresql://localhost/app')\n""",
+            "app/db.py": """import sqlite3\n\n_conn = sqlite3.connect(':memory:', check_same_thread=False)\n_conn.execute('CREATE TABLE IF NOT EXISTS users (id TEXT, email TEXT, role TEXT)')\n\ndef execute(query: str):\n    return _conn.execute(query)\n\nclass DB:\n    execute = staticmethod(execute)\n\ndb = DB()\n""",
+        },
+        vulnerabilities=[
+            VulnerabilitySpec(
+                id="E-01",
+                filename="app/routes.py",
+                line=8,
+                vuln_type="sql_injection",
+                severity="high",
+                title="Unsanitized SQL query with user-controlled id",
+                rationale="Direct string interpolation builds SQL using request input.",
+            ),
+            VulnerabilitySpec(
+                id="E-02",
+                filename="app/config.py",
+                line=5,
+                vuln_type="hardcoded_secret",
+                severity="high",
+                title="Hardcoded production secret key",
+                rationale="Embedding secrets in code leaks credentials and breaks rotation.",
+            ),
+            VulnerabilitySpec(
+                id="E-03",
+                filename="app/routes.py",
+                line=15,
+                vuln_type="weak_authentication",
+                severity="medium",
+                title="Backdoor static token for login",
+                rationale="Fixed token bypasses proper authentication controls.",
+            ),
+        ],
+    ),
+    "medium": TaskSpec(
+        id="medium",
+        title="Payment Webhook Service",
+        difficulty="medium",
+        objective=(
+            "Review a webhook and export pipeline used by operations. Catch logic and "
+            "input-handling flaws that can be exploited in production."
+        ),
+        max_steps=14,
+        repository={
+            "service/webhook.py": """import hmac\nimport hashlib\nfrom flask import request, jsonify\n\nSHARED_SECRET = 'whsec_test_123'\n\ndef verify_signature(body: bytes, signature: str) -> bool:\n    digest = hmac.new(SHARED_SECRET.encode(), body, hashlib.sha256).hexdigest()\n    return digest == signature\n\ndef process_event(event: dict):\n    return {'processed': event.get('id')}\n\ndef webhook_handler():\n    body = request.data\n    signature = request.headers.get('X-Signature', '')\n    if not verify_signature(body, signature):\n        return jsonify({'error': 'invalid signature'}), 401\n\n    event = request.json or {}\n    if event.get('debug'):\n        # temporary bypass while partner migrates\n        return jsonify({'ok': True, 'bypass': True})\n\n    return jsonify(process_event(event))\n""",
+            "service/export.py": """import os\n\ndef build_export_cmd(job_name: str, output_dir: str) -> str:\n    return f\"tar -czf {output_dir}/{job_name}.tgz /srv/data/{job_name}\"\n\ndef run_export(job_name: str, output_dir: str):\n    cmd = build_export_cmd(job_name, output_dir)\n    os.system(cmd)\n    return {'status': 'ok', 'cmd': cmd}\n""",
+            "service/serializers.py": """import yaml\n\ndef load_template(raw: str):\n    return yaml.load(raw, Loader=yaml.Loader)\n""",
+        },
+        vulnerabilities=[
+            VulnerabilitySpec(
+                id="M-01",
+                filename="service/webhook.py",
+                line=10,
+                vuln_type="weak_authentication",
+                severity="medium",
+                title="Timing-unsafe signature comparison",
+                rationale="Using == for secrets can leak comparison timing signal.",
+            ),
+            VulnerabilitySpec(
+                id="M-02",
+                filename="service/webhook.py",
+                line=22,
+                vuln_type="weak_authentication",
+                severity="high",
+                title="Debug flag bypasses signature verification outcome",
+                rationale="Attacker-controlled debug field returns success without processing guards.",
+            ),
+            VulnerabilitySpec(
+                id="M-03",
+                filename="service/export.py",
+                line=8,
+                vuln_type="command_injection",
+                severity="critical",
+                title="Unsanitized shell command execution",
+                rationale="User-controlled job_name/output_dir flow into os.system command string.",
+            ),
+            VulnerabilitySpec(
+                id="M-04",
+                filename="service/serializers.py",
+                line=4,
+                vuln_type="insecure_deserialization",
+                severity="high",
+                title="Unsafe YAML loader",
+                rationale="yaml.Loader can construct arbitrary Python objects from attacker input.",
+            ),
+        ],
+    ),
+    "hard": TaskSpec(
+        id="hard",
+        title="Enterprise Multi-Tenant API",
+        difficulty="hard",
+        objective=(
+            "Audit an API gateway handling tenants, files, and callback fetches. "
+            "Find high-impact vulnerabilities without flooding false positives."
+        ),
+        max_steps=16,
+        repository={
+            "api/auth.py": """import base64\nimport json\nimport jwt\n\ndef issue_token(user_id: str, tenant_id: str):\n    payload = {'sub': user_id, 'tenant': tenant_id, 'role': 'member'}\n    return jwt.encode(payload, 'dev-key', algorithm='HS256')\n\ndef parse_token(token: str):\n    header_b64 = token.split('.')[0] + '=='\n    header = json.loads(base64.urlsafe_b64decode(header_b64).decode())\n    if header.get('alg') == 'none':\n        return json.loads(base64.urlsafe_b64decode(token.split('.')[1] + '==').decode())\n    return jwt.decode(token, 'dev-key', algorithms=['HS256'])\n""",
+            "api/files.py": """from flask import request, jsonify\n\nFILES = {\n    'tenant-a': {'1': 'a-private-doc'},\n    'tenant-b': {'2': 'b-private-doc'},\n}\n\ndef get_file(user):\n    file_id = request.args.get('file_id')\n    tenant = request.args.get('tenant')\n    data = FILES.get(tenant, {}).get(file_id)\n    if not data:\n        return jsonify({'error': 'not found'}), 404\n    return jsonify({'file': data, 'tenant': tenant, 'user': user['sub']})\n""",
+            "api/fetcher.py": """import requests\n\ndef fetch_preview(url: str):\n    response = requests.get(url, timeout=3)\n    return {'status': response.status_code, 'body': response.text[:120]}\n""",
+            "api/storage.py": """from pathlib import Path\n\nBASE = Path('/srv/uploads')\n\ndef read_attachment(path_fragment: str) -> bytes:\n    final_path = BASE / path_fragment\n    return final_path.read_bytes()\n""",
+        },
+        vulnerabilities=[
+            VulnerabilitySpec(
+                id="H-01",
+                filename="api/auth.py",
+                line=12,
+                vuln_type="weak_authentication",
+                severity="critical",
+                title="Accepts unsigned JWT tokens when alg=none",
+                rationale="Token parser trusts attacker-controlled header and bypasses signature checks.",
+            ),
+            VulnerabilitySpec(
+                id="H-02",
+                filename="api/files.py",
+                line=11,
+                vuln_type="weak_authentication",
+                severity="high",
+                title="Tenant access controlled by request parameter",
+                rationale="Requester can switch tenant query parameter and read cross-tenant data (IDOR).",
+            ),
+            VulnerabilitySpec(
+                id="H-03",
+                filename="api/fetcher.py",
+                line=4,
+                vuln_type="ssrf",
+                severity="high",
+                title="Server-side fetch of arbitrary URL",
+                rationale="Attacker can query internal metadata endpoints through backend network path.",
+            ),
+            VulnerabilitySpec(
+                id="H-04",
+                filename="api/storage.py",
+                line=6,
+                vuln_type="path_traversal",
+                severity="critical",
+                title="Unvalidated path join for file reads",
+                rationale="Path fragments containing .. can escape upload directory.",
+            ),
+        ],
+    ),
+}
+SEVERITY_WEIGHTS = {
+    "low": 1.0,
+    "medium": 2.0,
+    "high": 3.0,
+    "critical": 4.0,
+}
+TARGET_CONFIDENCE = {
+    "low": 0.55,
+    "medium": 0.65,
+    "high": 0.8,
+    "critical": 0.9,
+}
+def get_task(task_id: str) -> TaskSpec:
+    if task_id not in TASKS:
+        raise KeyError(f"Unknown task_id '{task_id}'. Available: {', '.join(sorted(TASKS))}")
+    return TASKS[task_id]
+def list_task_ids() -> List[str]:
+    return sorted(TASKS.keys())

tests/__pycache__/conftest.cpython-312-pytest-7.4.4.pyc ADDED Viewed

Binary file (726 Bytes). View file

tests/__pycache__/conftest.cpython-314-pytest-9.0.2.pyc ADDED Viewed

Binary file (739 Bytes). View file

tests/__pycache__/test_behavioral_scenarios.cpython-312-pytest-7.4.4.pyc ADDED Viewed

Binary file (31.4 kB). View file

tests/__pycache__/test_grader_and_env.cpython-312-pytest-7.4.4.pyc ADDED Viewed

Binary file (9.16 kB). View file

tests/__pycache__/test_grader_and_env.cpython-314-pytest-9.0.2.pyc ADDED Viewed

Binary file (10.6 kB). View file

tests/__pycache__/test_grader_and_env.cpython-314.pyc ADDED Viewed

Binary file (3.17 kB). View file

tests/conftest.py ADDED Viewed

	@@ -0,0 +1,10 @@

+from __future__ import annotations
+import sys
+from pathlib import Path
+# Make package importable when tests are run from the workspace root, e.g.:
+# python -m pytest -q OpenEnv/envs/code_security_auditor_env/tests/test_grader_and_env.py
+_ENVS_DIR = Path(__file__).resolve().parents[2]
+if str(_ENVS_DIR) not in sys.path:
+    sys.path.insert(0, str(_ENVS_DIR))

tests/test_behavioral_scenarios.py ADDED Viewed

	@@ -0,0 +1,476 @@

+from __future__ import annotations
+from typing import Iterable
+import pytest
+from pydantic import ValidationError
+from code_security_auditor_env.models import CodeSecurityAction
+from code_security_auditor_env.server.security_environment import CodeSecurityAuditorEnvironment
+def _action(**kwargs) -> CodeSecurityAction:
+    return CodeSecurityAction(**kwargs)
+def _run_actions(task_id: str, actions: Iterable[CodeSecurityAction]) -> tuple[float, list[float]]:
+    env = CodeSecurityAuditorEnvironment(default_task_id=task_id)
+    obs = env.reset(task_id=task_id)
+    rewards: list[float] = [float(obs.reward or 0.0)]
+    for action in actions:
+        obs = env.step(action)
+        rewards.append(float(obs.reward or 0.0))
+        if obs.done:
+            break
+    if not obs.done:
+        obs = env.step(_action(action_type="submit_final_report"))
+        rewards.append(float(obs.reward or 0.0))
+    return float(obs.reward or 0.0), rewards
+@pytest.mark.parametrize(
+    "task_id,expected_file_count",
+    [
+        ("easy", 3),
+        ("medium", 3),
+        ("hard", 4),
+    ],
+)
+def test_reset_exposes_task_specific_observation_space(task_id: str, expected_file_count: int) -> None:
+    env = CodeSecurityAuditorEnvironment(default_task_id=task_id)
+    obs = env.reset(task_id=task_id)
+    assert obs.task_id == task_id
+    assert len(obs.available_files) == expected_file_count
+    assert obs.steps_remaining > 0
+    assert obs.file_excerpt == ""
+    assert obs.focused_file is None
+    assert 0.0 <= float(obs.score_hint) <= 1.0
+def test_action_space_validation_rejects_invalid_values() -> None:
+    with pytest.raises(ValidationError):
+        _action(action_type="not_valid")
+    with pytest.raises(ValidationError):
+        _action(action_type="submit_finding", confidence=1.5)
+    with pytest.raises(ValidationError):
+        _action(action_type="submit_finding", line_start=0)
+def test_inspect_file_returns_numbered_excerpt() -> None:
+    env = CodeSecurityAuditorEnvironment(default_task_id="easy")
+    env.reset(task_id="easy")
+    obs = env.step(_action(action_type="inspect_file", filename="app/routes.py"))
+    assert obs.focused_file == "app/routes.py"
+    assert "  1:" in obs.file_excerpt
+    assert "SELECT id, email, role" in obs.file_excerpt
+def test_partial_progress_reward_for_near_miss_finding() -> None:
+    env = CodeSecurityAuditorEnvironment(default_task_id="easy")
+    env.reset(task_id="easy")
+    obs = env.step(
+        _action(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=11,
+            line_end=11,
+            vuln_type="sql_injection",
+            severity="high",
+            confidence=0.8,
+            evidence="nearby SQL line",
+            summary="line slightly off",
+        )
+    )
+    assert 0.0 < float(obs.reward or 0.0) <= 0.2
+    assert "Partial progress" in obs.last_feedback
+def test_easy_task_high_quality_trajectory_scores_high() -> None:
+    actions = [
+        _action(action_type="inspect_file", filename="app/routes.py"),
+        _action(action_type="inspect_file", filename="app/config.py"),
+        _action(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=8,
+            vuln_type="sql_injection",
+            severity="high",
+            confidence=0.8,
+            evidence="f-string SQL query with request arg",
+            summary="SQL injection",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="app/config.py",
+            line_start=5,
+            vuln_type="hardcoded_secret",
+            severity="high",
+            confidence=0.85,
+            evidence="secret embedded in config",
+            summary="hardcoded secret",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=15,
+            vuln_type="weak_authentication",
+            severity="medium",
+            confidence=0.65,
+            evidence="static token auth bypass",
+            summary="weak authentication",
+        ),
+        _action(action_type="submit_final_report"),
+    ]
+    score, rewards = _run_actions("easy", actions)
+    assert score >= 0.75
+    assert all(0.0 <= r <= 1.0 for r in rewards)
+def test_reward_hacking_by_spam_and_duplicates_is_penalized() -> None:
+    strong_actions = [
+        _action(action_type="inspect_file", filename="app/routes.py"),
+        _action(action_type="inspect_file", filename="app/config.py"),
+        _action(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=8,
+            vuln_type="sql_injection",
+            severity="high",
+            confidence=0.8,
+            evidence="sql injection",
+            summary="sql injection",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="app/config.py",
+            line_start=5,
+            vuln_type="hardcoded_secret",
+            severity="high",
+            confidence=0.85,
+            evidence="hardcoded secret",
+            summary="hardcoded secret",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=15,
+            vuln_type="weak_authentication",
+            severity="medium",
+            confidence=0.65,
+            evidence="static token",
+            summary="weak auth",
+        ),
+        _action(action_type="submit_final_report"),
+    ]
+    spam_actions = [
+        _action(action_type="inspect_file", filename="app/routes.py"),
+        _action(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=8,
+            vuln_type="sql_injection",
+            severity="high",
+            confidence=0.8,
+            evidence="sql injection",
+            summary="sql injection",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=8,
+            vuln_type="sql_injection",
+            severity="high",
+            confidence=0.95,
+            evidence="duplicate #1",
+            summary="duplicate #1",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=8,
+            vuln_type="sql_injection",
+            severity="high",
+            confidence=0.99,
+            evidence="duplicate #2",
+            summary="duplicate #2",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=2,
+            vuln_type="xss",
+            severity="critical",
+            confidence=1.0,
+            evidence="intentional false positive",
+            summary="intentional false positive",
+        ),
+        _action(action_type="submit_final_report"),
+    ]
+    strong_score, _ = _run_actions("easy", strong_actions)
+    spam_score, _ = _run_actions("easy", spam_actions)
+    assert strong_score > spam_score
+    assert spam_score < 0.6
+def test_medium_and_hard_tasks_support_successful_completion() -> None:
+    medium_actions = [
+        _action(action_type="inspect_file", filename="service/webhook.py"),
+        _action(action_type="inspect_file", filename="service/export.py"),
+        _action(action_type="inspect_file", filename="service/serializers.py"),
+        _action(
+            action_type="submit_finding",
+            filename="service/webhook.py",
+            line_start=10,
+            vuln_type="weak_authentication",
+            severity="medium",
+            confidence=0.65,
+            evidence="timing unsafe compare",
+            summary="signature compare",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="service/webhook.py",
+            line_start=22,
+            vuln_type="weak_authentication",
+            severity="high",
+            confidence=0.8,
+            evidence="debug bypass",
+            summary="debug bypass",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="service/export.py",
+            line_start=8,
+            vuln_type="command_injection",
+            severity="critical",
+            confidence=0.92,
+            evidence="os.system with user input",
+            summary="command injection",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="service/serializers.py",
+            line_start=4,
+            vuln_type="insecure_deserialization",
+            severity="high",
+            confidence=0.83,
+            evidence="yaml.Loader unsafe",
+            summary="unsafe yaml load",
+        ),
+        _action(action_type="submit_final_report"),
+    ]
+    hard_actions = [
+        _action(action_type="inspect_file", filename="api/auth.py"),
+        _action(action_type="inspect_file", filename="api/files.py"),
+        _action(action_type="inspect_file", filename="api/fetcher.py"),
+        _action(action_type="inspect_file", filename="api/storage.py"),
+        _action(
+            action_type="submit_finding",
+            filename="api/auth.py",
+            line_start=12,
+            vuln_type="weak_authentication",
+            severity="critical",
+            confidence=0.9,
+            evidence="alg=none token acceptance",
+            summary="jwt none alg",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="api/files.py",
+            line_start=11,
+            vuln_type="weak_authentication",
+            severity="high",
+            confidence=0.8,
+            evidence="tenant param controls authorization",
+            summary="idor cross tenant",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="api/fetcher.py",
+            line_start=4,
+            vuln_type="ssrf",
+            severity="high",
+            confidence=0.8,
+            evidence="requests.get arbitrary URL",
+            summary="ssrf",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="api/storage.py",
+            line_start=6,
+            vuln_type="path_traversal",
+            severity="critical",
+            confidence=0.9,
+            evidence="path join without normalization",
+            summary="path traversal",
+        ),
+        _action(action_type="submit_final_report"),
+    ]
+    medium_score, medium_rewards = _run_actions("medium", medium_actions)
+    hard_score, hard_rewards = _run_actions("hard", hard_actions)
+    assert medium_score >= 0.7
+    assert hard_score >= 0.7
+    assert all(0.0 <= r <= 1.0 for r in medium_rewards)
+    assert all(0.0 <= r <= 1.0 for r in hard_rewards)
+def test_confidence_miscalibration_reduces_partial_progress_rewards() -> None:
+    # Use line offsets that produce partial (not confirmed) matches so confidence
+    # calibration impacts component score and therefore shaped reward.
+    overconfident_actions = [
+        _action(action_type="inspect_file", filename="app/routes.py"),
+        _action(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=13,
+            vuln_type="sql_injection",
+            severity="high",
+            confidence=1.0,
+            evidence="near miss with inflated confidence #1",
+            summary="near miss #1",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="app/config.py",
+            line_start=1,
+            vuln_type="hardcoded_secret",
+            severity="high",
+            confidence=1.0,
+            evidence="near miss with inflated confidence #2",
+            summary="near miss #2",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=20,
+            vuln_type="weak_authentication",
+            severity="medium",
+            confidence=1.0,
+            evidence="near miss with inflated confidence #3",
+            summary="near miss #3",
+        ),
+        _action(action_type="submit_final_report"),
+    ]
+    calibrated_actions = [
+        _action(action_type="inspect_file", filename="app/routes.py"),
+        _action(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=13,
+            vuln_type="sql_injection",
+            severity="high",
+            confidence=0.8,
+            evidence="near miss with calibrated confidence #1",
+            summary="near miss #1",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="app/config.py",
+            line_start=1,
+            vuln_type="hardcoded_secret",
+            severity="high",
+            confidence=0.8,
+            evidence="near miss with calibrated confidence #2",
+            summary="near miss #2",
+        ),
+        _action(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=20,
+            vuln_type="weak_authentication",
+            severity="medium",
+            confidence=0.65,
+            evidence="near miss with calibrated confidence #3",
+            summary="near miss #3",
+        ),
+        _action(action_type="submit_final_report"),
+    ]
+    overconf_score, overconf_rewards = _run_actions("easy", overconfident_actions)
+    calibrated_score, calibrated_rewards = _run_actions("easy", calibrated_actions)
+    assert sum(calibrated_rewards) > sum(overconf_rewards)
+    assert calibrated_score >= overconf_score
+def test_step_limit_stalling_strategy_auto_finalizes_with_low_score() -> None:
+    env = CodeSecurityAuditorEnvironment(default_task_id="easy")
+    obs = env.reset(task_id="easy")
+    # Repeatedly inspect the same non-critical pattern to simulate stalling.
+    while not obs.done:
+        obs = env.step(_action(action_type="inspect_file", filename="app/db.py"))
+    assert obs.done is True
+    assert 0.0 <= float(obs.reward or 0.0) <= 1.0
+    assert float(obs.reward or 0.0) < 0.5
+    assert "Max steps reached" in obs.last_feedback
+def test_repeated_duplicate_confirmed_findings_reduce_quality_multiplier() -> None:
+    env = CodeSecurityAuditorEnvironment(default_task_id="easy")
+    env.reset(task_id="easy")
+    first = env.step(
+        _action(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=8,
+            vuln_type="sql_injection",
+            severity="high",
+            confidence=0.8,
+            evidence="correct first finding",
+            summary="correct first finding",
+        )
+    )
+    qm_after_first = float(first.metadata["quality_multiplier"])
+    second = env.step(
+        _action(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=8,
+            vuln_type="sql_injection",
+            severity="high",
+            confidence=0.95,
+            evidence="duplicate second",
+            summary="duplicate second",
+        )
+    )
+    qm_after_second = float(second.metadata["quality_multiplier"])
+    third = env.step(
+        _action(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=8,
+            vuln_type="sql_injection",
+            severity="high",
+            confidence=1.0,
+            evidence="duplicate third",
+            summary="duplicate third",
+        )
+    )
+    qm_after_third = float(third.metadata["quality_multiplier"])
+    assert qm_after_second < qm_after_first
+    assert qm_after_third < qm_after_second
+    assert int(third.metadata["duplicate_submission_count"]) >= 2

tests/test_grader_and_env.py ADDED Viewed

	@@ -0,0 +1,63 @@

+from __future__ import annotations
+from code_security_auditor_env.models import CodeSecurityAction
+from code_security_auditor_env.server.grader import evaluate_finding
+from code_security_auditor_env.server.security_environment import CodeSecurityAuditorEnvironment
+from code_security_auditor_env.server.tasks import get_task
+def test_grader_deterministic_easy_match() -> None:
+    task = get_task("easy")
+    first = task.vulnerabilities[0]
+    eval_a = evaluate_finding(
+        task=task,
+        filename=first.filename,
+        vuln_type=first.vuln_type,
+        severity=first.severity,
+        line_start=first.line,
+        line_end=first.line,
+        confidence=0.8,
+        matched_already=[],
+    )
+    eval_b = evaluate_finding(
+        task=task,
+        filename=first.filename,
+        vuln_type=first.vuln_type,
+        severity=first.severity,
+        line_start=first.line,
+        line_end=first.line,
+        confidence=0.8,
+        matched_already=[],
+    )
+    assert eval_a == eval_b
+    assert eval_a.is_confirmed_match
+    assert 0.0 <= eval_a.component_score <= 1.0
+def test_env_final_score_in_unit_interval() -> None:
+    env = CodeSecurityAuditorEnvironment(default_task_id="easy")
+    obs = env.reset(task_id="easy")
+    assert obs.task_id == "easy"
+    obs = env.step(CodeSecurityAction(action_type="inspect_file", filename="app/routes.py"))
+    assert 0.0 <= float(obs.reward or 0.0) <= 1.0
+    obs = env.step(
+        CodeSecurityAction(
+            action_type="submit_finding",
+            filename="app/routes.py",
+            line_start=8,
+            vuln_type="sql_injection",
+            severity="high",
+            confidence=0.85,
+            evidence="user id interpolated in SQL",
+            summary="SQL injection in get_user",
+        )
+    )
+    assert 0.0 <= float(obs.reward or 0.0) <= 1.0
+    obs = env.step(CodeSecurityAction(action_type="submit_final_report"))
+    assert obs.done is True
+    assert 0.0 <= float(obs.reward or 0.0) <= 1.0

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff

validate-submission.sh ADDED Viewed

	@@ -0,0 +1,145 @@

+#!/usr/bin/env bash
+set -uo pipefail
+DOCKER_BUILD_TIMEOUT=600
+if [ -t 1 ]; then
+  RED='\033[0;31m'
+  GREEN='\033[0;32m'
+  YELLOW='\033[1;33m'
+  BOLD='\033[1m'
+  NC='\033[0m'
+else
+  RED='' GREEN='' YELLOW='' BOLD='' NC=''
+fi
+run_with_timeout() {
+  local secs="$1"; shift
+  if command -v timeout &>/dev/null; then
+    timeout "$secs" "$@"
+  elif command -v gtimeout &>/dev/null; then
+    gtimeout "$secs" "$@"
+  else
+    "$@" &
+    local pid=$!
+    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
+    local watcher=$!
+    wait "$pid" 2>/dev/null
+    local rc=$?
+    kill "$watcher" 2>/dev/null
+    wait "$watcher" 2>/dev/null
+    return $rc
+  fi
+}
+portable_mktemp() {
+  local prefix="${1:-validate}"
+  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
+}
+CLEANUP_FILES=()
+cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
+trap cleanup EXIT
+PING_URL="${1:-}"
+REPO_DIR="${2:-.}"
+if [ -z "$PING_URL" ]; then
+  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
+  exit 1
+fi
+if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
+  printf "Error: directory '%s' not found\n" "${2:-.}"
+  exit 1
+fi
+PING_URL="${PING_URL%/}"
+PASS=0
+log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
+pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
+fail() { log "${RED}FAILED${NC} -- $1"; }
+hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
+stop_at() {
+  printf "\n"
+  printf "${RED}${BOLD}Validation stopped at %s.${NC}\n" "$1"
+  exit 1
+}
+printf "\n${BOLD}========================================${NC}\n"
+printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+log "Repo:     $REPO_DIR"
+log "Ping URL: $PING_URL"
+printf "\n"
+log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
+CURL_OUTPUT=$(portable_mktemp "validate-curl")
+CLEANUP_FILES+=("$CURL_OUTPUT")
+HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
+  -H "Content-Type: application/json" -d '{}' \
+  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
+if [ "$HTTP_CODE" = "200" ]; then
+  pass "HF Space is live and responds to /reset"
+elif [ "$HTTP_CODE" = "000" ]; then
+  fail "HF Space not reachable"
+  hint "Check the Space URL and runtime status."
+  stop_at "Step 1"
+else
+  fail "HF Space /reset returned HTTP $HTTP_CODE"
+  hint "Make sure the app is healthy and running on app_port=8000."
+  stop_at "Step 1"
+fi
+log "${BOLD}Step 2/3: Running docker build${NC} ..."
+if ! command -v docker &>/dev/null; then
+  fail "docker command not found"
+  hint "Install Docker first."
+  stop_at "Step 2"
+fi
+if [ -f "$REPO_DIR/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR"
+elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR/server"
+else
+  fail "No Dockerfile found in root or server/"
+  stop_at "Step 2"
+fi
+BUILD_OK=false
+BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
+if [ "$BUILD_OK" = true ]; then
+  pass "Docker build succeeded"
+else
+  fail "Docker build failed"
+  printf "%s\n" "$BUILD_OUTPUT" | tail -20
+  stop_at "Step 2"
+fi
+log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
+if ! command -v openenv &>/dev/null; then
+  fail "openenv command not found"
+  hint "Install it with: pip install openenv-core"
+  stop_at "Step 3"
+fi
+VALIDATE_OK=false
+VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
+if [ "$VALIDATE_OK" = true ]; then
+  pass "openenv validate passed"
+else
+  fail "openenv validate failed"
+  printf "%s\n" "$VALIDATE_OUTPUT"
+  stop_at "Step 3"
+fi
+printf "\n${BOLD}========================================${NC}\n"
+printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
+printf "${BOLD}========================================${NC}\n\n"