Drac0528 commited on
Commit
f0b9b7f
·
verified ·
1 Parent(s): f4e93c5

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -184
README.md DELETED
@@ -1,184 +0,0 @@
1
- # Code Security Auditor Environment
2
-
3
- A real-world OpenEnv benchmark where agents perform security auditing on pull-request style code snapshots.
4
-
5
- The agent inspects files, submits vulnerability findings, and finalizes a report. The environment scores by deterministic graders over true vulnerability ground truth with partial credit and anti-reward-hacking penalties.
6
-
7
- ## Why this is a real-world task
8
-
9
- Security reviewers and AppSec engineers routinely audit code for vulnerabilities before deployment. This environment models that workflow with concrete exploit classes:
10
-
11
- - SQL injection
12
- - command injection
13
- - insecure deserialization
14
- - weak authentication / auth bypass
15
- - SSRF
16
- - path traversal
17
- - hardcoded secrets
18
-
19
- ## OpenEnv Compliance
20
-
21
- - Typed models: CodeSecurityAction, CodeSecurityObservation, CodeSecurityState
22
- - Core API: reset(), step(), state()
23
- - OpenEnv manifest: openenv.yaml
24
- - FastAPI runtime via server.app:app
25
-
26
- ## Action Space
27
-
28
- Action model: CodeSecurityAction
29
-
30
- - action_type: inspect_file | submit_finding | submit_final_report
31
- - filename: target file to inspect or report against
32
- - line_start, line_end: suspected vulnerable range
33
- - vuln_type: one of supported vulnerability classes
34
- - severity: low | medium | high | critical
35
- - confidence: [0.0, 1.0]
36
- - evidence, summary: free-form context
37
-
38
- ### Action semantics
39
-
40
- - inspect_file: returns full line-numbered file content.
41
- - submit_finding: grades the finding with deterministic partial credit.
42
- - submit_final_report: ends the episode and returns final score in [0.0, 1.0].
43
-
44
- ## Observation Space
45
-
46
- Observation model: CodeSecurityObservation
47
-
48
- Key fields:
49
-
50
- - task_id, task_title, difficulty, objective
51
- - available_files
52
- - focused_file, file_excerpt
53
- - findings_so_far
54
- - steps_remaining
55
- - last_feedback
56
- - score_hint in [0, 1]
57
- - reward, done, metadata
58
-
59
- ## Tasks and Difficulty
60
-
61
- The environment includes 3 deterministic tasks:
62
-
63
- 1. easy: Legacy Flask Patch Review
64
- 2. medium: Payment Webhook Service
65
- 3. hard: Enterprise Multi-Tenant API
66
-
67
- Each task has:
68
-
69
- - realistic multi-file code snapshot
70
- - hidden vulnerability ground truth
71
- - deterministic grader with score in [0.0, 1.0]
72
-
73
- ## Reward Design
74
-
75
- Reward shaping is trajectory-aware and resistant to reward hacking:
76
-
77
- - inspect_file gives small positive signal for novel, relevant file exploration
78
- - submit_finding gives partial credit ladder (file -> type -> line -> severity -> confidence calibration)
79
- - duplicate/low-quality findings reduce quality_multiplier and final score
80
- - false positives and over-submission reduce precision and final score
81
- - final score combines weighted recall, precision, structural quality, and calibration
82
-
83
- This creates control and symmetry: spamming findings can increase step count but lowers precision and quality, preventing easy reward exploitation.
84
-
85
- ## Baseline Scores
86
-
87
- With deterministic tasks and a simple tool-using model loop, expected baseline tendencies are:
88
-
89
- - easy: high recall, moderate precision
90
- - medium: moderate recall, moderate precision
91
- - hard: lower recall, stricter penalties for noisy findings
92
-
93
- Run inference.py to generate reproducible per-task scores for your selected model setup.
94
-
95
- ## Setup
96
-
97
- ### Option A: Run in-repo (OpenEnv monorepo)
98
-
99
- From repository root:
100
-
101
- ```bash
102
- docker build -t code-security-auditor-env:latest -f envs/code_security_auditor_env/server/Dockerfile .
103
- docker run -p 8000:8000 code-security-auditor-env:latest
104
- ```
105
-
106
- ### Option B: Run standalone
107
-
108
- From this directory:
109
-
110
- ```bash
111
- docker build -t code-security-auditor-env:latest .
112
- docker run -p 8000:8000 code-security-auditor-env:latest
113
- ```
114
-
115
- ## Baseline Inference
116
-
117
- The required script is inference.py in project root (this directory).
118
-
119
- Required env vars:
120
-
121
- - API_BASE_URL
122
- - MODEL_NAME
123
- - HF_TOKEN
124
-
125
- Optional env vars:
126
-
127
- - LOCAL_IMAGE_NAME (for from_docker_image mode)
128
- - ENV_BASE_URL (for connecting to an already-running server)
129
- - TASK_IDS (comma-separated task ids, default: easy,medium,hard)
130
- - MAX_STEPS
131
-
132
- Run:
133
-
134
- ```bash
135
- export HF_TOKEN=your_token
136
- export API_BASE_URL=https://router.huggingface.co/v1
137
- export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
138
- export LOCAL_IMAGE_NAME=code-security-auditor-env:latest
139
- python inference.py
140
- ```
141
-
142
- The script prints only [START], [STEP], and [END] log lines per task.
143
-
144
- ## Hugging Face Spaces Deployment
145
-
146
- Space repository:
147
-
148
- - https://huggingface.co/spaces/Drac0528/CodeSecure
149
-
150
- Recommended deploy flow (git push to Space repo):
151
-
152
- ```bash
153
- git clone https://huggingface.co/spaces/Drac0528/CodeSecure
154
- cd CodeSecure
155
- cp -R /path/to/code_security_auditor_env/* .
156
- rm -f .env
157
- git add .
158
- git commit -m "Deploy Code Security Auditor OpenEnv"
159
- git push
160
- ```
161
-
162
- Notes:
163
-
164
- - Keep README frontmatter and Dockerfile at Space repo root.
165
- - Use Space Settings to set runtime secrets/variables:
166
- - HF_TOKEN (Secret)
167
- - API_BASE_URL (Variable)
168
- - MODEL_NAME (Variable)
169
- - Ensure Space tags include `openenv`.
170
-
171
- Verify API endpoint after build:
172
-
173
- ```bash
174
- curl -X POST https://drac0528-codesecure.hf.space/reset -H 'Content-Type: application/json' -d '{}'
175
- ```
176
-
177
- ## Validation
178
-
179
- Use validate-submission.sh before submitting:
180
-
181
- ```bash
182
- chmod +x validate-submission.sh
183
- ./validate-submission.sh https://drac0528-codesecure.hf.space .
184
- ```