Drac0528 commited on
Commit
e94c324
·
verified ·
1 Parent(s): 1d2d30c

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -200
README.md DELETED
@@ -1,200 +0,0 @@
1
- ---
2
- title: Code Security Auditor Environment
3
- emoji: "🛡️"
4
- colorFrom: yellow
5
- colorTo: red
6
- sdk: docker
7
- pinned: false
8
- app_port: 8000
9
- base_path: /web
10
- tags:
11
- - openenv
12
- - security
13
- - code-review
14
- - reinforcement-learning
15
- ---
16
-
17
- # Code Security Auditor Environment
18
-
19
- A real-world OpenEnv benchmark where agents perform security auditing on pull-request style code snapshots.
20
-
21
- The agent inspects files, submits vulnerability findings, and finalizes a report. The environment scores by deterministic graders over true vulnerability ground truth with partial credit and anti-reward-hacking penalties.
22
-
23
- ## Why this is a real-world task
24
-
25
- Security reviewers and AppSec engineers routinely audit code for vulnerabilities before deployment. This environment models that workflow with concrete exploit classes:
26
-
27
- - SQL injection
28
- - command injection
29
- - insecure deserialization
30
- - weak authentication / auth bypass
31
- - SSRF
32
- - path traversal
33
- - hardcoded secrets
34
-
35
- ## OpenEnv Compliance
36
-
37
- - Typed models: CodeSecurityAction, CodeSecurityObservation, CodeSecurityState
38
- - Core API: reset(), step(), state()
39
- - OpenEnv manifest: openenv.yaml
40
- - FastAPI runtime via server.app:app
41
-
42
- ## Action Space
43
-
44
- Action model: CodeSecurityAction
45
-
46
- - action_type: inspect_file | submit_finding | submit_final_report
47
- - filename: target file to inspect or report against
48
- - line_start, line_end: suspected vulnerable range
49
- - vuln_type: one of supported vulnerability classes
50
- - severity: low | medium | high | critical
51
- - confidence: [0.0, 1.0]
52
- - evidence, summary: free-form context
53
-
54
- ### Action semantics
55
-
56
- - inspect_file: returns full line-numbered file content.
57
- - submit_finding: grades the finding with deterministic partial credit.
58
- - submit_final_report: ends the episode and returns final score in [0.0, 1.0].
59
-
60
- ## Observation Space
61
-
62
- Observation model: CodeSecurityObservation
63
-
64
- Key fields:
65
-
66
- - task_id, task_title, difficulty, objective
67
- - available_files
68
- - focused_file, file_excerpt
69
- - findings_so_far
70
- - steps_remaining
71
- - last_feedback
72
- - score_hint in [0, 1]
73
- - reward, done, metadata
74
-
75
- ## Tasks and Difficulty
76
-
77
- The environment includes 3 deterministic tasks:
78
-
79
- 1. easy: Legacy Flask Patch Review
80
- 2. medium: Payment Webhook Service
81
- 3. hard: Enterprise Multi-Tenant API
82
-
83
- Each task has:
84
-
85
- - realistic multi-file code snapshot
86
- - hidden vulnerability ground truth
87
- - deterministic grader with score in [0.0, 1.0]
88
-
89
- ## Reward Design
90
-
91
- Reward shaping is trajectory-aware and resistant to reward hacking:
92
-
93
- - inspect_file gives small positive signal for novel, relevant file exploration
94
- - submit_finding gives partial credit ladder (file -> type -> line -> severity -> confidence calibration)
95
- - duplicate/low-quality findings reduce quality_multiplier and final score
96
- - false positives and over-submission reduce precision and final score
97
- - final score combines weighted recall, precision, structural quality, and calibration
98
-
99
- This creates control and symmetry: spamming findings can increase step count but lowers precision and quality, preventing easy reward exploitation.
100
-
101
- ## Baseline Scores
102
-
103
- With deterministic tasks and a simple tool-using model loop, expected baseline tendencies are:
104
-
105
- - easy: high recall, moderate precision
106
- - medium: moderate recall, moderate precision
107
- - hard: lower recall, stricter penalties for noisy findings
108
-
109
- Run inference.py to generate reproducible per-task scores for your selected model setup.
110
-
111
- ## Setup
112
-
113
- ### Option A: Run in-repo (OpenEnv monorepo)
114
-
115
- From repository root:
116
-
117
- ```bash
118
- docker build -t code-security-auditor-env:latest -f envs/code_security_auditor_env/server/Dockerfile .
119
- docker run -p 8000:8000 code-security-auditor-env:latest
120
- ```
121
-
122
- ### Option B: Run standalone
123
-
124
- From this directory:
125
-
126
- ```bash
127
- docker build -t code-security-auditor-env:latest .
128
- docker run -p 8000:8000 code-security-auditor-env:latest
129
- ```
130
-
131
- ## Baseline Inference
132
-
133
- The required script is inference.py in project root (this directory).
134
-
135
- Required env vars:
136
-
137
- - API_BASE_URL
138
- - MODEL_NAME
139
- - HF_TOKEN
140
-
141
- Optional env vars:
142
-
143
- - LOCAL_IMAGE_NAME (for from_docker_image mode)
144
- - ENV_BASE_URL (for connecting to an already-running server)
145
- - TASK_IDS (comma-separated task ids, default: easy,medium,hard)
146
- - MAX_STEPS
147
-
148
- Run:
149
-
150
- ```bash
151
- export HF_TOKEN=your_token
152
- export API_BASE_URL=https://router.huggingface.co/v1
153
- export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
154
- export LOCAL_IMAGE_NAME=code-security-auditor-env:latest
155
- python inference.py
156
- ```
157
-
158
- The script prints only [START], [STEP], and [END] log lines per task.
159
-
160
- ## Hugging Face Spaces Deployment
161
-
162
- Space repository:
163
-
164
- - https://huggingface.co/spaces/Drac0528/CodeSecure
165
-
166
- Recommended deploy flow (git push to Space repo):
167
-
168
- ```bash
169
- git clone https://huggingface.co/spaces/Drac0528/CodeSecure
170
- cd CodeSecure
171
- cp -R /path/to/code_security_auditor_env/* .
172
- rm -f .env
173
- git add .
174
- git commit -m "Deploy Code Security Auditor OpenEnv"
175
- git push
176
- ```
177
-
178
- Notes:
179
-
180
- - Keep README frontmatter and Dockerfile at Space repo root.
181
- - Use Space Settings to set runtime secrets/variables:
182
- - HF_TOKEN (Secret)
183
- - API_BASE_URL (Variable)
184
- - MODEL_NAME (Variable)
185
- - Ensure Space tags include `openenv`.
186
-
187
- Verify API endpoint after build:
188
-
189
- ```bash
190
- curl -X POST https://drac0528-codesecure.hf.space/reset -H 'Content-Type: application/json' -d '{}'
191
- ```
192
-
193
- ## Validation
194
-
195
- Use validate-submission.sh before submitting:
196
-
197
- ```bash
198
- chmod +x validate-submission.sh
199
- ./validate-submission.sh https://drac0528-codesecure.hf.space .
200
- ```