--- title: Code Security Auditor Environment emoji: "🛡️" colorFrom: yellow colorTo: red sdk: docker pinned: false app_port: 8000 base_path: /web tags: - openenv - security - code-review - reinforcement-learning --- # Code Security Auditor Environment A real-world OpenEnv benchmark where agents perform security auditing on pull-request style code snapshots. The agent inspects files, submits vulnerability findings, and finalizes a report. The environment scores by deterministic graders over true vulnerability ground truth with partial credit and anti-reward-hacking penalties. ## Why this is a real-world task Security reviewers and AppSec engineers routinely audit code for vulnerabilities before deployment. This environment models that workflow with concrete exploit classes: - SQL injection - command injection - insecure deserialization - weak authentication / auth bypass - SSRF - path traversal - hardcoded secrets ## OpenEnv Compliance - Typed models: CodeSecurityAction, CodeSecurityObservation, CodeSecurityState - Core API: reset(), step(), state() - OpenEnv manifest: openenv.yaml - FastAPI runtime via server.app:app ## Action Space Action model: CodeSecurityAction - action_type: inspect_file | submit_finding | submit_final_report - filename: target file to inspect or report against - line_start, line_end: suspected vulnerable range - vuln_type: one of supported vulnerability classes - severity: low | medium | high | critical - confidence: [0.0, 1.0] - evidence, summary: free-form context ### Action semantics - inspect_file: returns full line-numbered file content. - submit_finding: grades the finding with deterministic partial credit. - submit_final_report: ends the episode and returns final score in [0.0, 1.0]. ## Observation Space Observation model: CodeSecurityObservation Key fields: - task_id, task_title, difficulty, objective - available_files - focused_file, file_excerpt - findings_so_far - steps_remaining - last_feedback - score_hint in [0, 1] - reward, done, metadata ## Tasks and Difficulty The environment includes 3 deterministic tasks: 1. easy: Legacy Flask Patch Review 2. medium: Payment Webhook Service 3. hard: Enterprise Multi-Tenant API Each task has: - realistic multi-file code snapshot - hidden vulnerability ground truth - deterministic grader with score in [0.0, 1.0] ## Reward Design Reward shaping is trajectory-aware and resistant to reward hacking: - inspect_file gives small positive signal for novel, relevant file exploration - submit_finding gives partial credit ladder (file -> type -> line -> severity -> confidence calibration) - duplicate/low-quality findings reduce quality_multiplier and final score - false positives and over-submission reduce precision and final score - final score combines weighted recall, precision, structural quality, and calibration This creates control and symmetry: spamming findings can increase step count but lowers precision and quality, preventing easy reward exploitation. ## Baseline Scores With deterministic tasks and a simple tool-using model loop, expected baseline tendencies are: - easy: high recall, moderate precision - medium: moderate recall, moderate precision - hard: lower recall, stricter penalties for noisy findings Run inference.py to generate reproducible per-task scores for your selected model setup. ## Setup ### Option A: Run in-repo (OpenEnv monorepo) From repository root: ```bash docker build -t code-security-auditor-env:latest -f envs/code_security_auditor_env/server/Dockerfile . docker run -p 8000:8000 code-security-auditor-env:latest ``` ### Option B: Run standalone From this directory: ```bash docker build -t code-security-auditor-env:latest . docker run -p 8000:8000 code-security-auditor-env:latest ``` ## Baseline Inference The required script is inference.py in project root (this directory). Required env vars: - API_BASE_URL - MODEL_NAME - HF_TOKEN Optional env vars: - LOCAL_IMAGE_NAME (for from_docker_image mode) - ENV_BASE_URL (for connecting to an already-running server) - TASK_IDS (comma-separated task ids, default: easy,medium,hard) - MAX_STEPS Run: ```bash export HF_TOKEN=your_token export API_BASE_URL=https://router.huggingface.co/v1 export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct export LOCAL_IMAGE_NAME=code-security-auditor-env:latest python inference.py ``` The script prints only [START], [STEP], and [END] log lines per task. ## Hugging Face Spaces Deployment Space repository: - https://huggingface.co/spaces/Drac0528/CodeSecure Recommended deploy flow (git push to Space repo): ```bash git clone https://huggingface.co/spaces/Drac0528/CodeSecure cd CodeSecure cp -R /path/to/code_security_auditor_env/* . rm -f .env git add . git commit -m "Deploy Code Security Auditor OpenEnv" git push ``` Notes: - Keep README frontmatter and Dockerfile at Space repo root. - Use Space Settings to set runtime secrets/variables: - HF_TOKEN (Secret) - API_BASE_URL (Variable) - MODEL_NAME (Variable) - Ensure Space tags include `openenv`. Verify API endpoint after build: ```bash curl -X POST https://drac0528-codesecure.hf.space/reset -H 'Content-Type: application/json' -d '{}' ``` ## Validation Use validate-submission.sh before submitting: ```bash chmod +x validate-submission.sh ./validate-submission.sh https://drac0528-codesecure.hf.space . ```