Spaces:
Sleeping
Sleeping
File size: 5,377 Bytes
43d8ac0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 | ---
title: Code Security Auditor Environment
emoji: "🛡️"
colorFrom: yellow
colorTo: red
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
- security
- code-review
- reinforcement-learning
---
# Code Security Auditor Environment
A real-world OpenEnv benchmark where agents perform security auditing on pull-request style code snapshots.
The agent inspects files, submits vulnerability findings, and finalizes a report. The environment scores by deterministic graders over true vulnerability ground truth with partial credit and anti-reward-hacking penalties.
## Why this is a real-world task
Security reviewers and AppSec engineers routinely audit code for vulnerabilities before deployment. This environment models that workflow with concrete exploit classes:
- SQL injection
- command injection
- insecure deserialization
- weak authentication / auth bypass
- SSRF
- path traversal
- hardcoded secrets
## OpenEnv Compliance
- Typed models: CodeSecurityAction, CodeSecurityObservation, CodeSecurityState
- Core API: reset(), step(), state()
- OpenEnv manifest: openenv.yaml
- FastAPI runtime via server.app:app
## Action Space
Action model: CodeSecurityAction
- action_type: inspect_file | submit_finding | submit_final_report
- filename: target file to inspect or report against
- line_start, line_end: suspected vulnerable range
- vuln_type: one of supported vulnerability classes
- severity: low | medium | high | critical
- confidence: [0.0, 1.0]
- evidence, summary: free-form context
### Action semantics
- inspect_file: returns full line-numbered file content.
- submit_finding: grades the finding with deterministic partial credit.
- submit_final_report: ends the episode and returns final score in [0.0, 1.0].
## Observation Space
Observation model: CodeSecurityObservation
Key fields:
- task_id, task_title, difficulty, objective
- available_files
- focused_file, file_excerpt
- findings_so_far
- steps_remaining
- last_feedback
- score_hint in [0, 1]
- reward, done, metadata
## Tasks and Difficulty
The environment includes 3 deterministic tasks:
1. easy: Legacy Flask Patch Review
2. medium: Payment Webhook Service
3. hard: Enterprise Multi-Tenant API
Each task has:
- realistic multi-file code snapshot
- hidden vulnerability ground truth
- deterministic grader with score in [0.0, 1.0]
## Reward Design
Reward shaping is trajectory-aware and resistant to reward hacking:
- inspect_file gives small positive signal for novel, relevant file exploration
- submit_finding gives partial credit ladder (file -> type -> line -> severity -> confidence calibration)
- duplicate/low-quality findings reduce quality_multiplier and final score
- false positives and over-submission reduce precision and final score
- final score combines weighted recall, precision, structural quality, and calibration
This creates control and symmetry: spamming findings can increase step count but lowers precision and quality, preventing easy reward exploitation.
## Baseline Scores
With deterministic tasks and a simple tool-using model loop, expected baseline tendencies are:
- easy: high recall, moderate precision
- medium: moderate recall, moderate precision
- hard: lower recall, stricter penalties for noisy findings
Run inference.py to generate reproducible per-task scores for your selected model setup.
## Setup
### Option A: Run in-repo (OpenEnv monorepo)
From repository root:
```bash
docker build -t code-security-auditor-env:latest -f envs/code_security_auditor_env/server/Dockerfile .
docker run -p 8000:8000 code-security-auditor-env:latest
```
### Option B: Run standalone
From this directory:
```bash
docker build -t code-security-auditor-env:latest .
docker run -p 8000:8000 code-security-auditor-env:latest
```
## Baseline Inference
The required script is inference.py in project root (this directory).
Required env vars:
- API_BASE_URL
- MODEL_NAME
- HF_TOKEN
Optional env vars:
- LOCAL_IMAGE_NAME (for from_docker_image mode)
- ENV_BASE_URL (for connecting to an already-running server)
- TASK_IDS (comma-separated task ids, default: easy,medium,hard)
- MAX_STEPS
Run:
```bash
export HF_TOKEN=your_token
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
export LOCAL_IMAGE_NAME=code-security-auditor-env:latest
python inference.py
```
The script prints only [START], [STEP], and [END] log lines per task.
## Hugging Face Spaces Deployment
Space repository:
- https://huggingface.co/spaces/Drac0528/CodeSecure
Recommended deploy flow (git push to Space repo):
```bash
git clone https://huggingface.co/spaces/Drac0528/CodeSecure
cd CodeSecure
cp -R /path/to/code_security_auditor_env/* .
rm -f .env
git add .
git commit -m "Deploy Code Security Auditor OpenEnv"
git push
```
Notes:
- Keep README frontmatter and Dockerfile at Space repo root.
- Use Space Settings to set runtime secrets/variables:
- HF_TOKEN (Secret)
- API_BASE_URL (Variable)
- MODEL_NAME (Variable)
- Ensure Space tags include `openenv`.
Verify API endpoint after build:
```bash
curl -X POST https://drac0528-codesecure.hf.space/reset -H 'Content-Type: application/json' -d '{}'
```
## Validation
Use validate-submission.sh before submitting:
```bash
chmod +x validate-submission.sh
./validate-submission.sh https://drac0528-codesecure.hf.space .
```
|