Spaces:

Drac0528
/

CodeSecure

Sleeping

App Files Files Community

Drac0528 commited on Apr 8

Commit

f0b9b7f

verified ·

1 Parent(s): f4e93c5

Delete README.md

Browse files

Files changed (1) hide show

README.md +0 -184

README.md DELETED Viewed

@@ -1,184 +0,0 @@
-# Code Security Auditor Environment
-A real-world OpenEnv benchmark where agents perform security auditing on pull-request style code snapshots.
-The agent inspects files, submits vulnerability findings, and finalizes a report. The environment scores by deterministic graders over true vulnerability ground truth with partial credit and anti-reward-hacking penalties.
-## Why this is a real-world task
-Security reviewers and AppSec engineers routinely audit code for vulnerabilities before deployment. This environment models that workflow with concrete exploit classes:
-- SQL injection
-- command injection
-- insecure deserialization
-- weak authentication / auth bypass
-- SSRF
-- path traversal
-- hardcoded secrets
-## OpenEnv Compliance
-- Typed models: CodeSecurityAction, CodeSecurityObservation, CodeSecurityState
-- Core API: reset(), step(), state()
-- OpenEnv manifest: openenv.yaml
-- FastAPI runtime via server.app:app
-## Action Space
-Action model: CodeSecurityAction
-- action_type: inspect_file | submit_finding | submit_final_report
-- filename: target file to inspect or report against
-- line_start, line_end: suspected vulnerable range
-- vuln_type: one of supported vulnerability classes
-- severity: low | medium | high | critical
-- confidence: [0.0, 1.0]
-- evidence, summary: free-form context
-### Action semantics
-- inspect_file: returns full line-numbered file content.
-- submit_finding: grades the finding with deterministic partial credit.
-- submit_final_report: ends the episode and returns final score in [0.0, 1.0].
-## Observation Space
-Observation model: CodeSecurityObservation
-Key fields:
-- task_id, task_title, difficulty, objective
-- available_files
-- focused_file, file_excerpt
-- findings_so_far
-- steps_remaining
-- last_feedback
-- score_hint in [0, 1]
-- reward, done, metadata
-## Tasks and Difficulty
-The environment includes 3 deterministic tasks:
-1. easy: Legacy Flask Patch Review
-2. medium: Payment Webhook Service
-3. hard: Enterprise Multi-Tenant API
-Each task has:
-- realistic multi-file code snapshot
-- hidden vulnerability ground truth
-- deterministic grader with score in [0.0, 1.0]
-## Reward Design
-Reward shaping is trajectory-aware and resistant to reward hacking:
-- inspect_file gives small positive signal for novel, relevant file exploration
-- submit_finding gives partial credit ladder (file -> type -> line -> severity -> confidence calibration)
-- duplicate/low-quality findings reduce quality_multiplier and final score
-- false positives and over-submission reduce precision and final score
-- final score combines weighted recall, precision, structural quality, and calibration
-This creates control and symmetry: spamming findings can increase step count but lowers precision and quality, preventing easy reward exploitation.
-## Baseline Scores
-With deterministic tasks and a simple tool-using model loop, expected baseline tendencies are:
-- easy: high recall, moderate precision
-- medium: moderate recall, moderate precision
-- hard: lower recall, stricter penalties for noisy findings
-Run inference.py to generate reproducible per-task scores for your selected model setup.
-## Setup
-### Option A: Run in-repo (OpenEnv monorepo)
-From repository root:
-```bash
-docker build -t code-security-auditor-env:latest -f envs/code_security_auditor_env/server/Dockerfile .
-docker run -p 8000:8000 code-security-auditor-env:latest
-```
-### Option B: Run standalone
-From this directory:
-```bash
-docker build -t code-security-auditor-env:latest .
-docker run -p 8000:8000 code-security-auditor-env:latest
-```
-## Baseline Inference
-The required script is inference.py in project root (this directory).
-Required env vars:
-- API_BASE_URL
-- MODEL_NAME
-- HF_TOKEN
-Optional env vars:
-- LOCAL_IMAGE_NAME (for from_docker_image mode)
-- ENV_BASE_URL (for connecting to an already-running server)
-- TASK_IDS (comma-separated task ids, default: easy,medium,hard)
-- MAX_STEPS
-Run:
-```bash
-export HF_TOKEN=your_token
-export API_BASE_URL=https://router.huggingface.co/v1
-export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
-export LOCAL_IMAGE_NAME=code-security-auditor-env:latest
-python inference.py
-```
-The script prints only [START], [STEP], and [END] log lines per task.
-## Hugging Face Spaces Deployment
-Space repository:
-- https://huggingface.co/spaces/Drac0528/CodeSecure
-Recommended deploy flow (git push to Space repo):
-```bash
-git clone https://huggingface.co/spaces/Drac0528/CodeSecure
-cd CodeSecure
-cp -R /path/to/code_security_auditor_env/* .
-rm -f .env
-git add .
-git commit -m "Deploy Code Security Auditor OpenEnv"
-git push
-```
-Notes:
-- Keep README frontmatter and Dockerfile at Space repo root.
-- Use Space Settings to set runtime secrets/variables:
-  - HF_TOKEN (Secret)
-  - API_BASE_URL (Variable)
-  - MODEL_NAME (Variable)
-- Ensure Space tags include `openenv`.
-Verify API endpoint after build:
-```bash
-curl -X POST https://drac0528-codesecure.hf.space/reset -H 'Content-Type: application/json' -d '{}'
-```
-## Validation
-Use validate-submission.sh before submitting:
-```bash
-chmod +x validate-submission.sh
-./validate-submission.sh https://drac0528-codesecure.hf.space .
-```