Spaces:

Drac0528
/

CodeSecure

Sleeping

App Files Files Community

Drac0528 commited on Apr 8

Commit

43d8ac0

verified ·

1 Parent(s): f0b9b7f

Upload README.md

Browse files

Files changed (1) hide show

README.md +200 -0

README.md ADDED Viewed

	@@ -0,0 +1,200 @@

+---
+title: Code Security Auditor Environment
+emoji: "🛡️"
+colorFrom: yellow
+colorTo: red
+sdk: docker
+pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
+  - security
+  - code-review
+  - reinforcement-learning
+---
+# Code Security Auditor Environment
+A real-world OpenEnv benchmark where agents perform security auditing on pull-request style code snapshots.
+The agent inspects files, submits vulnerability findings, and finalizes a report. The environment scores by deterministic graders over true vulnerability ground truth with partial credit and anti-reward-hacking penalties.
+## Why this is a real-world task
+Security reviewers and AppSec engineers routinely audit code for vulnerabilities before deployment. This environment models that workflow with concrete exploit classes:
+- SQL injection
+- command injection
+- insecure deserialization
+- weak authentication / auth bypass
+- SSRF
+- path traversal
+- hardcoded secrets
+## OpenEnv Compliance
+- Typed models: CodeSecurityAction, CodeSecurityObservation, CodeSecurityState
+- Core API: reset(), step(), state()
+- OpenEnv manifest: openenv.yaml
+- FastAPI runtime via server.app:app
+## Action Space
+Action model: CodeSecurityAction
+- action_type: inspect_file | submit_finding | submit_final_report
+- filename: target file to inspect or report against
+- line_start, line_end: suspected vulnerable range
+- vuln_type: one of supported vulnerability classes
+- severity: low | medium | high | critical
+- confidence: [0.0, 1.0]
+- evidence, summary: free-form context
+### Action semantics
+- inspect_file: returns full line-numbered file content.
+- submit_finding: grades the finding with deterministic partial credit.
+- submit_final_report: ends the episode and returns final score in [0.0, 1.0].
+## Observation Space
+Observation model: CodeSecurityObservation
+Key fields:
+- task_id, task_title, difficulty, objective
+- available_files
+- focused_file, file_excerpt
+- findings_so_far
+- steps_remaining
+- last_feedback
+- score_hint in [0, 1]
+- reward, done, metadata
+## Tasks and Difficulty
+The environment includes 3 deterministic tasks:
+1. easy: Legacy Flask Patch Review
+2. medium: Payment Webhook Service
+3. hard: Enterprise Multi-Tenant API
+Each task has:
+- realistic multi-file code snapshot
+- hidden vulnerability ground truth
+- deterministic grader with score in [0.0, 1.0]
+## Reward Design
+Reward shaping is trajectory-aware and resistant to reward hacking:
+- inspect_file gives small positive signal for novel, relevant file exploration
+- submit_finding gives partial credit ladder (file -> type -> line -> severity -> confidence calibration)
+- duplicate/low-quality findings reduce quality_multiplier and final score
+- false positives and over-submission reduce precision and final score
+- final score combines weighted recall, precision, structural quality, and calibration
+This creates control and symmetry: spamming findings can increase step count but lowers precision and quality, preventing easy reward exploitation.
+## Baseline Scores
+With deterministic tasks and a simple tool-using model loop, expected baseline tendencies are:
+- easy: high recall, moderate precision
+- medium: moderate recall, moderate precision
+- hard: lower recall, stricter penalties for noisy findings
+Run inference.py to generate reproducible per-task scores for your selected model setup.
+## Setup
+### Option A: Run in-repo (OpenEnv monorepo)
+From repository root:
+```bash
+docker build -t code-security-auditor-env:latest -f envs/code_security_auditor_env/server/Dockerfile .
+docker run -p 8000:8000 code-security-auditor-env:latest
+```
+### Option B: Run standalone
+From this directory:
+```bash
+docker build -t code-security-auditor-env:latest .
+docker run -p 8000:8000 code-security-auditor-env:latest
+```
+## Baseline Inference
+The required script is inference.py in project root (this directory).
+Required env vars:
+- API_BASE_URL
+- MODEL_NAME
+- HF_TOKEN
+Optional env vars:
+- LOCAL_IMAGE_NAME (for from_docker_image mode)
+- ENV_BASE_URL (for connecting to an already-running server)
+- TASK_IDS (comma-separated task ids, default: easy,medium,hard)
+- MAX_STEPS
+Run:
+```bash
+export HF_TOKEN=your_token
+export API_BASE_URL=https://router.huggingface.co/v1
+export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
+export LOCAL_IMAGE_NAME=code-security-auditor-env:latest
+python inference.py
+```
+The script prints only [START], [STEP], and [END] log lines per task.
+## Hugging Face Spaces Deployment
+Space repository:
+- https://huggingface.co/spaces/Drac0528/CodeSecure
+Recommended deploy flow (git push to Space repo):
+```bash
+git clone https://huggingface.co/spaces/Drac0528/CodeSecure
+cd CodeSecure
+cp -R /path/to/code_security_auditor_env/* .
+rm -f .env
+git add .
+git commit -m "Deploy Code Security Auditor OpenEnv"
+git push
+```
+Notes:
+- Keep README frontmatter and Dockerfile at Space repo root.
+- Use Space Settings to set runtime secrets/variables:
+  - HF_TOKEN (Secret)
+  - API_BASE_URL (Variable)
+  - MODEL_NAME (Variable)
+- Ensure Space tags include `openenv`.
+Verify API endpoint after build:
+```bash
+curl -X POST https://drac0528-codesecure.hf.space/reset -H 'Content-Type: application/json' -d '{}'
+```
+## Validation
+Use validate-submission.sh before submitting:
+```bash
+chmod +x validate-submission.sh
+./validate-submission.sh https://drac0528-codesecure.hf.space .
+```