Spaces:

Drac0528
/

CodeSecure

Sleeping

App Files Files Community

CodeSecure / README.md

Drac0528

Upload README.md

43d8ac0 verified about 1 month ago

preview code

raw

history blame contribute delete

5.38 kB

	---
	title: Code Security Auditor Environment
	emoji: "🛡️"
	colorFrom: yellow
	colorTo: red
	sdk: docker
	pinned: false
	app_port: 8000
	base_path: /web
	tags:
	- openenv
	- security
	- code-review
	- reinforcement-learning
	---

	# Code Security Auditor Environment

	A real-world OpenEnv benchmark where agents perform security auditing on pull-request style code snapshots.

	The agent inspects files, submits vulnerability findings, and finalizes a report. The environment scores by deterministic graders over true vulnerability ground truth with partial credit and anti-reward-hacking penalties.

	## Why this is a real-world task

	Security reviewers and AppSec engineers routinely audit code for vulnerabilities before deployment. This environment models that workflow with concrete exploit classes:

	- SQL injection
	- command injection
	- insecure deserialization
	- weak authentication / auth bypass
	- SSRF
	- path traversal
	- hardcoded secrets

	## OpenEnv Compliance

	- Typed models: CodeSecurityAction, CodeSecurityObservation, CodeSecurityState
	- Core API: reset(), step(), state()
	- OpenEnv manifest: openenv.yaml
	- FastAPI runtime via server.app:app

	## Action Space

	Action model: CodeSecurityAction

	- action_type: inspect_file \| submit_finding \| submit_final_report
	- filename: target file to inspect or report against
	- line_start, line_end: suspected vulnerable range
	- vuln_type: one of supported vulnerability classes
	- severity: low \| medium \| high \| critical
	- confidence: [0.0, 1.0]
	- evidence, summary: free-form context

	### Action semantics

	- inspect_file: returns full line-numbered file content.
	- submit_finding: grades the finding with deterministic partial credit.
	- submit_final_report: ends the episode and returns final score in [0.0, 1.0].

	## Observation Space

	Observation model: CodeSecurityObservation

	Key fields:

	- task_id, task_title, difficulty, objective
	- available_files
	- focused_file, file_excerpt
	- findings_so_far
	- steps_remaining
	- last_feedback
	- score_hint in [0, 1]
	- reward, done, metadata

	## Tasks and Difficulty

	The environment includes 3 deterministic tasks:

	1. easy: Legacy Flask Patch Review
	2. medium: Payment Webhook Service
	3. hard: Enterprise Multi-Tenant API

	Each task has:

	- realistic multi-file code snapshot
	- hidden vulnerability ground truth
	- deterministic grader with score in [0.0, 1.0]

	## Reward Design

	Reward shaping is trajectory-aware and resistant to reward hacking:

	- inspect_file gives small positive signal for novel, relevant file exploration
	- submit_finding gives partial credit ladder (file -> type -> line -> severity -> confidence calibration)
	- duplicate/low-quality findings reduce quality_multiplier and final score
	- false positives and over-submission reduce precision and final score
	- final score combines weighted recall, precision, structural quality, and calibration

	This creates control and symmetry: spamming findings can increase step count but lowers precision and quality, preventing easy reward exploitation.

	## Baseline Scores

	With deterministic tasks and a simple tool-using model loop, expected baseline tendencies are:

	- easy: high recall, moderate precision
	- medium: moderate recall, moderate precision
	- hard: lower recall, stricter penalties for noisy findings

	Run inference.py to generate reproducible per-task scores for your selected model setup.

	## Setup

	### Option A: Run in-repo (OpenEnv monorepo)

	From repository root:

	```bash
	docker build -t code-security-auditor-env:latest -f envs/code_security_auditor_env/server/Dockerfile .
	docker run -p 8000:8000 code-security-auditor-env:latest
	```

	### Option B: Run standalone

	From this directory:

	```bash
	docker build -t code-security-auditor-env:latest .
	docker run -p 8000:8000 code-security-auditor-env:latest
	```

	## Baseline Inference

	The required script is inference.py in project root (this directory).

	Required env vars:

	- API_BASE_URL
	- MODEL_NAME
	- HF_TOKEN

	Optional env vars:

	- LOCAL_IMAGE_NAME (for from_docker_image mode)
	- ENV_BASE_URL (for connecting to an already-running server)
	- TASK_IDS (comma-separated task ids, default: easy,medium,hard)
	- MAX_STEPS

	Run:

	```bash
	export HF_TOKEN=your_token
	export API_BASE_URL=https://router.huggingface.co/v1
	export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
	export LOCAL_IMAGE_NAME=code-security-auditor-env:latest
	python inference.py
	```

	The script prints only [START], [STEP], and [END] log lines per task.

	## Hugging Face Spaces Deployment

	Space repository:

	- https://huggingface.co/spaces/Drac0528/CodeSecure

	Recommended deploy flow (git push to Space repo):

	```bash
	git clone https://huggingface.co/spaces/Drac0528/CodeSecure
	cd CodeSecure
	cp -R /path/to/code_security_auditor_env/* .
	rm -f .env
	git add .
	git commit -m "Deploy Code Security Auditor OpenEnv"
	git push
	```

	Notes:

	- Keep README frontmatter and Dockerfile at Space repo root.
	- Use Space Settings to set runtime secrets/variables:
	- HF_TOKEN (Secret)
	- API_BASE_URL (Variable)
	- MODEL_NAME (Variable)
	- Ensure Space tags include `openenv`.

	Verify API endpoint after build:

	```bash
	curl -X POST https://drac0528-codesecure.hf.space/reset -H 'Content-Type: application/json' -d '{}'
	```

	## Validation

	Use validate-submission.sh before submitting:

	```bash
	chmod +x validate-submission.sh
	./validate-submission.sh https://drac0528-codesecure.hf.space .
	```