Spaces:
Sleeping
Sleeping
| title: Code Security Auditor Environment | |
| emoji: "🛡️" | |
| colorFrom: yellow | |
| colorTo: red | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| base_path: /web | |
| tags: | |
| - openenv | |
| - security | |
| - code-review | |
| - reinforcement-learning | |
| # Code Security Auditor Environment | |
| A real-world OpenEnv benchmark where agents perform security auditing on pull-request style code snapshots. | |
| The agent inspects files, submits vulnerability findings, and finalizes a report. The environment scores by deterministic graders over true vulnerability ground truth with partial credit and anti-reward-hacking penalties. | |
| ## Why this is a real-world task | |
| Security reviewers and AppSec engineers routinely audit code for vulnerabilities before deployment. This environment models that workflow with concrete exploit classes: | |
| - SQL injection | |
| - command injection | |
| - insecure deserialization | |
| - weak authentication / auth bypass | |
| - SSRF | |
| - path traversal | |
| - hardcoded secrets | |
| ## OpenEnv Compliance | |
| - Typed models: CodeSecurityAction, CodeSecurityObservation, CodeSecurityState | |
| - Core API: reset(), step(), state() | |
| - OpenEnv manifest: openenv.yaml | |
| - FastAPI runtime via server.app:app | |
| ## Action Space | |
| Action model: CodeSecurityAction | |
| - action_type: inspect_file | submit_finding | submit_final_report | |
| - filename: target file to inspect or report against | |
| - line_start, line_end: suspected vulnerable range | |
| - vuln_type: one of supported vulnerability classes | |
| - severity: low | medium | high | critical | |
| - confidence: [0.0, 1.0] | |
| - evidence, summary: free-form context | |
| ### Action semantics | |
| - inspect_file: returns full line-numbered file content. | |
| - submit_finding: grades the finding with deterministic partial credit. | |
| - submit_final_report: ends the episode and returns final score in [0.0, 1.0]. | |
| ## Observation Space | |
| Observation model: CodeSecurityObservation | |
| Key fields: | |
| - task_id, task_title, difficulty, objective | |
| - available_files | |
| - focused_file, file_excerpt | |
| - findings_so_far | |
| - steps_remaining | |
| - last_feedback | |
| - score_hint in [0, 1] | |
| - reward, done, metadata | |
| ## Tasks and Difficulty | |
| The environment includes 3 deterministic tasks: | |
| 1. easy: Legacy Flask Patch Review | |
| 2. medium: Payment Webhook Service | |
| 3. hard: Enterprise Multi-Tenant API | |
| Each task has: | |
| - realistic multi-file code snapshot | |
| - hidden vulnerability ground truth | |
| - deterministic grader with score in [0.0, 1.0] | |
| ## Reward Design | |
| Reward shaping is trajectory-aware and resistant to reward hacking: | |
| - inspect_file gives small positive signal for novel, relevant file exploration | |
| - submit_finding gives partial credit ladder (file -> type -> line -> severity -> confidence calibration) | |
| - duplicate/low-quality findings reduce quality_multiplier and final score | |
| - false positives and over-submission reduce precision and final score | |
| - final score combines weighted recall, precision, structural quality, and calibration | |
| This creates control and symmetry: spamming findings can increase step count but lowers precision and quality, preventing easy reward exploitation. | |
| ## Baseline Scores | |
| With deterministic tasks and a simple tool-using model loop, expected baseline tendencies are: | |
| - easy: high recall, moderate precision | |
| - medium: moderate recall, moderate precision | |
| - hard: lower recall, stricter penalties for noisy findings | |
| Run inference.py to generate reproducible per-task scores for your selected model setup. | |
| ## Setup | |
| ### Option A: Run in-repo (OpenEnv monorepo) | |
| From repository root: | |
| ```bash | |
| docker build -t code-security-auditor-env:latest -f envs/code_security_auditor_env/server/Dockerfile . | |
| docker run -p 8000:8000 code-security-auditor-env:latest | |
| ``` | |
| ### Option B: Run standalone | |
| From this directory: | |
| ```bash | |
| docker build -t code-security-auditor-env:latest . | |
| docker run -p 8000:8000 code-security-auditor-env:latest | |
| ``` | |
| ## Baseline Inference | |
| The required script is inference.py in project root (this directory). | |
| Required env vars: | |
| - API_BASE_URL | |
| - MODEL_NAME | |
| - HF_TOKEN | |
| Optional env vars: | |
| - LOCAL_IMAGE_NAME (for from_docker_image mode) | |
| - ENV_BASE_URL (for connecting to an already-running server) | |
| - TASK_IDS (comma-separated task ids, default: easy,medium,hard) | |
| - MAX_STEPS | |
| Run: | |
| ```bash | |
| export HF_TOKEN=your_token | |
| export API_BASE_URL=https://router.huggingface.co/v1 | |
| export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct | |
| export LOCAL_IMAGE_NAME=code-security-auditor-env:latest | |
| python inference.py | |
| ``` | |
| The script prints only [START], [STEP], and [END] log lines per task. | |
| ## Hugging Face Spaces Deployment | |
| Space repository: | |
| - https://huggingface.co/spaces/Drac0528/CodeSecure | |
| Recommended deploy flow (git push to Space repo): | |
| ```bash | |
| git clone https://huggingface.co/spaces/Drac0528/CodeSecure | |
| cd CodeSecure | |
| cp -R /path/to/code_security_auditor_env/* . | |
| rm -f .env | |
| git add . | |
| git commit -m "Deploy Code Security Auditor OpenEnv" | |
| git push | |
| ``` | |
| Notes: | |
| - Keep README frontmatter and Dockerfile at Space repo root. | |
| - Use Space Settings to set runtime secrets/variables: | |
| - HF_TOKEN (Secret) | |
| - API_BASE_URL (Variable) | |
| - MODEL_NAME (Variable) | |
| - Ensure Space tags include `openenv`. | |
| Verify API endpoint after build: | |
| ```bash | |
| curl -X POST https://drac0528-codesecure.hf.space/reset -H 'Content-Type: application/json' -d '{}' | |
| ``` | |
| ## Validation | |
| Use validate-submission.sh before submitting: | |
| ```bash | |
| chmod +x validate-submission.sh | |
| ./validate-submission.sh https://drac0528-codesecure.hf.space . | |
| ``` | |