kanishcr7 commited on
Commit
fd12515
ยท
1 Parent(s): 97f7a40

Add GRPO training plots and resolve README conflicts

Browse files
Files changed (3) hide show
  1. README.md +55 -250
  2. assets/grpo1.png +0 -0
  3. assets/grpo2.png +0 -0
README.md CHANGED
@@ -1,4 +1,4 @@
1
- # PatchHawk
2
 
3
  [![Weights & Biases](https://img.shields.io/badge/Weights%20%26%20Biases-FFBE00?logo=weightsandbiases&logoColor=black)](https://wandb.ai)
4
  [![Hugging Face](https://img.shields.io/badge/Hugging%20Face-FFD21E?logo=huggingface&logoColor=black)](https://huggingface.co)
@@ -6,22 +6,15 @@
6
  [![OpenEnv](https://img.shields.io/badge/OpenEnv-Compliant-2ea44f)](https://openenv.dev)
7
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
8
 
9
- <<<<<<< HEAD
10
- [![Weights & Biases](https://img.shields.io/badge/Weights%20%26%20Biases-FFBE00?logo=weightsandbiases&logoColor=black)](https://wandb.ai)
11
- [![Hugging Face](https://img.shields.io/badge/Hugging%20Face-FFD21E?logo=huggingface&logoColor=black)](https://huggingface.co)
12
- [![Python 3.12](https://img.shields.io/badge/Python-3.12-blue?logo=python&logoColor=white)](https://python.org)
13
- [![OpenEnv](https://img.shields.io/badge/OpenEnv-Compliant-2ea44f)](https://openenv.dev)
14
- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
15
-
16
- **Built for the OpenEnv Hackathon 2026 by Meta**
17
 
18
- PatchHawk is an autonomous DevSecOps agent powered by Group Relative Policy Optimization (GRPO). It moves beyond static vulnerability detection by validating findings inside isolated Docker sandboxes and generating verified, syntactically correct patches. The system closes the loop between detection, validation, and remediation through a cyberโ€‘physical reinforcement learning feedback cycle.
19
 
20
  ---
21
 
22
- ## ๐Ÿ“ฝ๏ธ The Vision: Cyberโ€‘Physical RL Loop
23
 
24
- Traditional security scanners suffer from high falseโ€‘positive rates and often report vulnerabilities that cannot be exploited or fixed in practice. PatchHawk addresses this by implementing a reinforcement learning loop where the model's reward is tied directly to the success of its patches inside a real execution environment.
25
 
26
  ```mermaid
27
  graph TD
@@ -33,113 +26,35 @@ graph TD
33
  B -->|Patch| G[Verification Pipeline]
34
  G -->|Syntax Check| H{Success?}
35
  G -->|Unit Tests| I{Pass?}
36
- G -->|Reโ€‘Attack| J{Defeated?}
37
  H & I & J -->|All Pass| K[Positive Reward +3.0]
38
  H | I | J -->|Failure| L[Negative Penalty -1.5]
39
  K --> M[Model Update / Optimization]
40
  L --> M
41
  ```
42
 
43
- The agent learns to produce patches that not only compile but also withstand reโ€‘execution of the original exploit vector.
44
 
45
  ---
46
 
47
  ## โœจ Key Features
48
 
49
- - ๐Ÿ›ก๏ธ **Autonomous Detection**: Sophisticated supplyโ€‘chain analysis identifying typosquatting, backdoors, data exfiltration, and malicious logic in dependencies.
50
- - ๐Ÿณ **Hardened Sandboxing**: Highโ€‘fidelity Docker isolation with networkโ€‘disabled execution, strict resource caps, and ephemeral file systems to safely detonate suspicious code.
51
- - ๐Ÿง  **GRPOโ€‘Driven Learning**: Group Relative Policy Optimization (inspired by DeepSeekโ€‘R1) enables trialโ€‘andโ€‘error mastery and structured reasoning without a separate critic model.
52
- - ๐Ÿงฉ **XML Reasoning Traces**: All agent decisions are accompanied by a machineโ€‘readable `<thought>...</thought>` block, providing full auditability of the decisionโ€‘making process.
53
- - ๐Ÿ“Š **SOC Dashboard**: Realโ€‘time Streamlit interface for monitoring agent behavior, sandbox telemetry, and reward breakdowns.
54
- - โœ… **OpenEnv Compliance**: Fully integrated with the PyTorch OpenEnv framework, ensuring reproducible and shareable reinforcement learning environments.
55
 
56
  ---
57
 
58
  ## ๐Ÿ› ๏ธ Project Structure
59
- =======
60
- **Submitted to the OpenEnv Hackathon 2026 โ€” hosted by Meta.**
61
-
62
- PatchHawk is an autonomous DevSecOps agent trained with Group Relative Policy Optimization (GRPO). It moves beyond static vulnerability detection by validating findings inside isolated Docker sandboxes and generating syntactically correct, re-attack-verified patches. The system closes the loop between detection, validation, and remediation through a reinforcement learning feedback cycle grounded in real execution environments.
63
-
64
- ---
65
-
66
- ## Table of Contents
67
-
68
- - [Architecture Overview](#architecture-overview)
69
- - [Key Capabilities](#key-capabilities)
70
- - [Project Structure](#project-structure)
71
- - [Getting Started](#getting-started)
72
- - [Prerequisites](#prerequisites)
73
- - [Installation](#installation)
74
- - [Environment Setup](#environment-setup)
75
- - [Running the Agent](#running-the-agent)
76
- - [Training](#training)
77
- - [Reward Rubric](#reward-rubric)
78
- - [Dashboard](#dashboard)
79
- - [Roadmap](#roadmap)
80
- - [License](#license)
81
-
82
- ---
83
-
84
- ## Architecture Overview
85
-
86
- Traditional security scanners suffer from high false-positive rates and produce findings that are often unexploitable or unfixable in practice. PatchHawk addresses this through a reinforcement learning loop in which the agent's reward is tied directly to the outcome of its patches inside a live execution environment.
87
-
88
- ```
89
- Source Code / PR
90
- |
91
- v
92
- PatchHawk Agent
93
- / | \
94
- Analyze Test Patch
95
- | |
96
- Docker Verification
97
- Sandbox Pipeline
98
- | |
99
- Behavioral Syntax Check
100
- Telemetry Unit Tests
101
- | Re-Attack
102
- \ /
103
- Reward Signal
104
- |
105
- Model Update
106
- ```
107
 
108
- The agent learns to produce patches that not only compile but also withstand re-execution of the original exploit vector. Every decision is accompanied by a structured `<thought>` block, providing a complete and machine-readable audit trail.
109
-
110
- ---
111
-
112
- ## Key Capabilities
113
-
114
- **Autonomous Detection**
115
- Comprehensive supply-chain analysis targeting typosquatting, backdoors, data exfiltration payloads, and malicious dependency logic.
116
- >>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
117
-
118
- **Hardened Sandboxing**
119
- Docker-based isolation with network-disabled execution, strict resource caps, and ephemeral file systems for safe detonation of suspicious packages.
120
-
121
- **GRPO-Driven Learning**
122
- Group Relative Policy Optimization, drawing from the DeepSeek-R1 methodology, enables structured trial-and-error mastery without requiring a separate critic model.
123
-
124
- **Structured Reasoning Traces**
125
- All agent actions are accompanied by a `<thought>...</thought>` XML block logged for full decision auditability.
126
-
127
- **SOC Dashboard**
128
- Real-time Streamlit interface displaying agent reasoning, sandbox telemetry, and reward breakdowns by action type.
129
-
130
- **OpenEnv Compliance**
131
- Fully integrated with the PyTorch OpenEnv framework, ensuring reproducible and shareable reinforcement learning environments.
132
-
133
- ---
134
-
135
- ## Project Structure
136
-
137
- ```
138
  PatchHawk/
139
- <<<<<<< HEAD
140
  โ”œโ”€โ”€ src/envs/patchhawk/ # ๐Ÿ“ฆ OpenEnv Submission Package
141
  โ”‚ โ”œโ”€โ”€ server/ # FastAPI environment server
142
- โ”‚ โ”œโ”€โ”€ models.py # Typeโ€‘safe contract definitions
143
  โ”‚ โ”œโ”€โ”€ client.py # Environment interaction client
144
  โ”‚ โ””โ”€โ”€ inference.py # Main agent execution loop
145
  โ”œโ”€โ”€ patchhawk/ # ๐Ÿง  Core Logic & Training
@@ -147,140 +62,71 @@ PatchHawk/
147
  โ”‚ โ”œโ”€โ”€ training/ # GRPO / Unsloth training scripts
148
  โ”‚ โ””โ”€โ”€ app/ # Streamlit SOC Dashboard
149
  โ”œโ”€โ”€ docker/ # ๐Ÿณ Container configurations
 
150
  โ”œโ”€โ”€ config.yaml # Environment & Agent configuration
151
  โ”œโ”€โ”€ openenv.yaml # OpenEnv metadata
152
  โ”œโ”€โ”€ .env.example # Environment variable template
153
- =======
154
- โ”œโ”€โ”€ src/
155
- โ”‚ โ””โ”€โ”€ envs/
156
- โ”‚ โ””โ”€โ”€ patchhawk/
157
- โ”‚ โ”œโ”€โ”€ server/ # FastAPI environment server
158
- โ”‚ โ”œโ”€โ”€ models.py # Type-safe contract definitions
159
- โ”‚ โ”œโ”€โ”€ client.py # Environment interaction client
160
- โ”‚ โ””โ”€โ”€ inference.py # Agent execution loop
161
- โ”œโ”€โ”€ patchhawk/
162
- โ”‚ โ”œโ”€โ”€ data/ # Scenario generation and datasets
163
- โ”‚ โ”œโ”€โ”€ training/ # GRPO training scripts
164
- โ”‚ โ””โ”€โ”€ app/ # Streamlit SOC Dashboard
165
- โ”œโ”€โ”€ docker/
166
- โ”‚ โ””โ”€โ”€ Dockerfile.sandbox # Sandbox container configuration
167
- โ”œโ”€โ”€ config.yaml # Environment and agent configuration
168
- โ”œโ”€โ”€ openenv.yaml # OpenEnv metadata
169
- โ”œโ”€โ”€ .env.example # Environment variable template
170
- >>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
171
  โ””โ”€โ”€ README.md
172
  ```
173
 
174
  ---
175
 
176
- ## Getting Started
177
 
178
  ### Prerequisites
179
 
180
- <<<<<<< HEAD
181
- - Python 3.12 or higher
182
- - Docker Engine (running locally, with buildx available)
183
- - NVIDIA GPU (8 GB VRAM or more recommended for training and inference)
184
- - Hugging Face account and token (for model access)
185
- =======
186
  - Python 3.12 or higher
187
- - Docker Engine with buildx support
188
- - NVIDIA GPU with 8 GB VRAM or more (required for training; recommended for inference)
189
  - Hugging Face account and access token
190
- >>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
191
 
192
  ### Installation
193
 
194
- Clone the repository and install dependencies into a virtual environment.
195
-
196
  ```bash
 
197
  git clone https://github.com/ramprasathk07/PatchHawk.git
198
  cd PatchHawk
199
 
200
- <<<<<<< HEAD
201
  # Create and activate a virtual environment
202
  python -m venv .venv
203
  source .venv/bin/activate # On Windows: .venv\Scripts\activate
204
 
205
  # Install core dependencies
206
- =======
207
- python -m venv .venv
208
- source .venv/bin/activate # Windows: .venv\Scripts\activate
209
-
210
- >>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
211
  pip install -e .
212
  ```
213
 
214
  ### Environment Setup
215
 
216
  ```bash
217
- <<<<<<< HEAD
218
  # Copy the environment template and populate your keys
219
  cp .env.example .env
220
- # Edit .env to include HF_TOKEN, OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.
221
 
222
  # Build the validation sandbox Docker image
223
- =======
224
- cp .env.example .env
225
- # Populate .env with HF_TOKEN, OPENAI_API_KEY, WANDB_API_KEY, and any other required keys.
226
-
227
- >>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
228
  docker build -t patchhawk-sandbox:latest -f docker/Dockerfile.sandbox .
229
  ```
230
 
231
  ### Running the Agent
232
 
233
- Start the environment server and the inference loop in separate terminal sessions.
234
-
235
  ```bash
236
- <<<<<<< HEAD
237
- # Start the environment server (in one terminal)
238
- python -m server.app --port 8000
239
-
240
- # Execute the inference loop (in another terminal)
241
- =======
242
  # Terminal 1 โ€” environment server
243
  python -m server.app --port 8000
244
 
245
  # Terminal 2 โ€” inference loop
246
- >>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
247
  python src/envs/patchhawk/inference.py --env-url http://localhost:8000
248
  ```
249
 
250
  ---
251
 
252
- <<<<<<< HEAD
253
- ## ๐Ÿ’Ž Reward Rubric
254
-
255
- The agent is guided by a granular reward structure that encourages safe, effective, and verifiable actions.
256
-
257
- | Action ID | Action Name | Base Reward | Success Criteria |
258
- | :--- | :--- | :--- | :--- |
259
- | **0** | `ANALYZE` | `0.0` | Observation step; used solely for data gathering. |
260
- | **1** | `DETONATE` | `+0.1` | Successfully extract telemetry from the Docker sandbox. |
261
- | **2** | `BLOCK_PR` | `+2.0 / -1.0` | Positive reward when correctly blocking a malicious PR; negative penalty for false positives. |
262
- | **3** | `SUBMIT_PATCH` | `+3.0 / -1.5` | The primary goal. Reward requires passing syntax check, unit tests, and a reโ€‘attack validation. |
263
- | **4** | `ESCALATE` | `0.0` | Hands off to a human expert when uncertainty exceeds a configurable threshold. |
264
-
265
- ### Dynamic Scaling Factors
266
- - **Risk Accuracy Bonus**: Up to `+2.0` additional reward for accurately predicting the risk score of a vulnerability.
267
- - **Safety Multiplier**: Repeated syntax check failures apply a decay factor to all future rewards.
268
- =======
269
- ## Training
270
 
271
  PatchHawk uses GRPO with a 4-bit quantised Qwen2.5-Coder-7B-Instruct base model and LoRA adapters. The training script is located at `patchhawk/training/train_grpo.py`.
272
 
273
- **Dependencies**
274
 
275
- ```bash
276
- pip install trl==1.0.0 peft bitsandbytes accelerate transformers datasets wandb
277
- ```
278
-
279
- **Dry run (CPU, no model required)**
280
-
281
- ```bash
282
- python -m patchhawk.training.train_grpo --dry-run
283
- ```
284
 
285
  **GPU training (RTX 3060 12 GB defaults)**
286
 
@@ -294,100 +140,59 @@ python -m patchhawk.training.train_grpo \
294
  --output-dir grpo_lora
295
  ```
296
 
297
- Key training parameters and their recommended values for a 12 GB GPU are documented inline in `train_grpo.py`. To upload the trained adapter to the Hugging Face Hub, set the `HF_REPO` environment variable before running.
298
- >>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
299
-
300
  ---
301
 
302
- ## Reward Rubric
303
-
304
- <<<<<<< HEAD
305
- Launch the **Security Operations Center (SOC)** dashboard to observe the agent's reasoning in real time.
306
- =======
307
- The agent is guided by a granular reward structure that incentivises safe, effective, and verifiable actions.
308
 
309
- | Action ID | Action Name | Base Reward | Success Criteria |
310
- |-----------|----------------|--------------|------------------|
311
- | 0 | ANALYZE | 0.0 | Observation step; used for data gathering only. |
312
- | 1 | DETONATE | +0.1 | Successful telemetry extraction from the Docker sandbox. |
313
- | 2 | BLOCK\_PR | +2.0 / -1.0 | Positive reward for correctly blocking a malicious PR; penalty for false positives. |
314
- | 3 | SUBMIT\_PATCH | +3.0 / -1.5 | Reward requires passing syntax check, unit tests, and re-attack validation. |
315
- | 4 | ESCALATE | 0.0 | Defers to a human expert when uncertainty exceeds a configurable threshold. |
316
 
317
- **Dynamic Scaling Factors**
 
 
 
 
 
 
318
 
319
- - **Risk Accuracy Bonus.** Up to +2.0 additional reward for accurately predicting the risk score of a detected vulnerability.
320
- - **Safety Multiplier.** Repeated syntax check failures apply a cumulative decay factor to all future rewards within a training episode.
 
321
 
322
  ---
323
 
324
- ## Dashboard
325
 
326
- Launch the Security Operations Centre dashboard to monitor the agent in real time.
327
- >>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
328
 
329
  ```bash
330
  streamlit run patchhawk/app/dashboard.py
331
  ```
332
 
333
- <<<<<<< HEAD
334
  The dashboard provides:
335
- - Live XML reasoning logs from the agent.
336
- - Realโ€‘time stdout/stderr streams from the Docker sandbox.
337
- - Detailed audit trail of reward assignments and verification outcomes.
338
 
339
  ---
340
 
341
  ## ๐Ÿ—บ๏ธ Roadmap & Future Work
342
 
343
- - [ ] **Multiโ€‘Agent Coordination**: Deploy attacker and defender models for automated redโ€‘teaming exercises.
344
- - [ ] **CVE Ingestion**: Automatically generate training scenarios from the National Vulnerability Database (NVD).
345
- - [ ] **Cross-Language Support**: Expand beyond Python to Go, JavaScript, Rust, and Java.
346
- - [ ] **Kubernetes Native**: Orchestrate sandboxes at scale using Kubernetes instead of local Docker.
347
- - [ ] **Fineโ€‘Tuned Vulnerability Model**: Train a specialized 7B parameter LLM (e.g., VulnLLMโ€‘R) on vulnerabilityโ€‘fixing commits.
348
- - [ ] **Contextโ€‘Aware Analysis**: Integrate Code Property Graph (CPG) slicing for LLMโ€‘based semantic vulnerability detection.
349
- - [ ] **Silent Patch Detection**: Identify securityโ€‘relevant commits that were not publicly disclosed.
350
- - [ ] **AIโ€‘Generated Code Audit**: Trace vulnerabilities back to AI coding assistants (e.g., GitHub Copilot, ChatGPT).
351
- - [ ] **Automated PR Remediation**: Generate and submit fixโ€‘containing pull requests for detected vulnerabilities.
352
- - [ ] **Adversarial Training Loop**: Implement a selfโ€‘improving LLMโ€‘vsโ€‘LLM redโ€‘team / blueโ€‘team training regimen.
353
- - [ ] **Supplyโ€‘Chain Malware Detection**: Extend dependency analysis to identify novel, unpublished attack patterns.
354
- =======
355
- The dashboard exposes the following views:
356
-
357
- - Live structured reasoning logs (`<thought>` traces) from the agent.
358
- - Real-time stdout and stderr streams from the Docker sandbox.
359
- - Detailed audit trail of reward assignments and verification outcomes per episode.
360
-
361
- ---
362
-
363
- ## Roadmap
364
-
365
- The following capabilities are planned for future releases. Contributions and issue reports are welcome.
366
-
367
- - **Multi-Agent Red-Teaming.** Deploy attacker and defender models for automated adversarial exercises.
368
- - **CVE Ingestion.** Automatically generate training scenarios from the National Vulnerability Database.
369
- - **Cross-Language Support.** Extend analysis beyond Python to Go, JavaScript, Rust, and Java.
370
- - **Kubernetes Orchestration.** Scale sandbox execution using Kubernetes instead of local Docker.
371
- - **Fine-Tuned Vulnerability Model.** Train a specialised model on vulnerability-fixing commits.
372
- - **Code Property Graph Integration.** Apply CPG slicing for semantic vulnerability detection.
373
- - **Silent Patch Detection.** Identify security-relevant commits that were not publicly disclosed.
374
- - **AI-Generated Code Audit.** Trace vulnerabilities to AI coding assistants such as GitHub Copilot.
375
- - **Automated PR Remediation.** Generate and submit fix-containing pull requests for detected issues.
376
- - **Adversarial Self-Improvement.** Implement an LLM-vs-LLM red-team / blue-team training regimen.
377
- - **Supply-Chain Malware Detection.** Extend dependency analysis to novel, unpublished attack patterns.
378
- - **Dashboard Enhancements.** Add historical trend analysis, model performance metrics, and alerting.
379
- >>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
380
 
381
  ---
382
 
383
- ## License
384
 
385
- <<<<<<< HEAD
386
  Distributed under the **MIT License**. See the LICENSE file in the repository root for full details.
387
 
388
  Developed with โค๏ธ by **Ramprasath K & The PatchHawk Team** for the OpenEnv Hackathon 2026 hosted by Meta.
389
- =======
390
- Distributed under the MIT License. See `LICENSE` in the repository root for the full terms.
391
-
392
- Developed by Ramprasath K and the PatchHawk team.
393
- >>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
 
1
+ # ๐Ÿฆ… PatchHawk: Autonomous Supply-Chain Guard
2
 
3
  [![Weights & Biases](https://img.shields.io/badge/Weights%20%26%20Biases-FFBE00?logo=weightsandbiases&logoColor=black)](https://wandb.ai)
4
  [![Hugging Face](https://img.shields.io/badge/Hugging%20Face-FFD21E?logo=huggingface&logoColor=black)](https://huggingface.co)
 
6
  [![OpenEnv](https://img.shields.io/badge/OpenEnv-Compliant-2ea44f)](https://openenv.dev)
7
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
8
 
9
+ **Submitted to the OpenEnv Hackathon 2026 โ€” hosted by Meta.**
 
 
 
 
 
 
 
10
 
11
+ PatchHawk is an autonomous DevSecOps agent powered by Group Relative Policy Optimization (GRPO). It moves beyond static vulnerability detection by validating findings inside isolated Docker sandboxes and generating verified, syntactically correct patches. The system closes the loop between detection, validation, and remediation through a cyber-physical reinforcement learning feedback cycle grounded in real execution environments.
12
 
13
  ---
14
 
15
+ ## ๐Ÿ“ฝ๏ธ The Vision: Cyber-Physical RL Loop
16
 
17
+ Traditional security scanners suffer from high false-positive rates and often report vulnerabilities that cannot be exploited or fixed in practice. PatchHawk addresses this by implementing a reinforcement learning loop where the model's reward is tied directly to the success of its patches inside a real execution environment.
18
 
19
  ```mermaid
20
  graph TD
 
26
  B -->|Patch| G[Verification Pipeline]
27
  G -->|Syntax Check| H{Success?}
28
  G -->|Unit Tests| I{Pass?}
29
+ G -->|Re-Attack| J{Defeated?}
30
  H & I & J -->|All Pass| K[Positive Reward +3.0]
31
  H | I | J -->|Failure| L[Negative Penalty -1.5]
32
  K --> M[Model Update / Optimization]
33
  L --> M
34
  ```
35
 
36
+ The agent learns to produce patches that not only compile but also withstand re-execution of the original exploit vector. Every decision is accompanied by a structured `<thought>` block, providing a complete and machine-readable audit trail.
37
 
38
  ---
39
 
40
  ## โœจ Key Features
41
 
42
+ - ๐Ÿ›ก๏ธ **Autonomous Detection**: Sophisticated supply-chain analysis identifying typosquatting, backdoors, data exfiltration, and malicious logic in dependencies.
43
+ - ๐Ÿณ **Hardened Sandboxing**: High-fidelity Docker isolation with network-disabled execution, strict resource caps, and ephemeral file systems to safely detonate suspicious code.
44
+ - ๐Ÿง  **GRPO-Driven Learning**: Group Relative Policy Optimization (inspired by DeepSeek-R1) enables trial-and-error mastery and structured reasoning without a separate critic model.
45
+ - ๐Ÿงฉ **XML Reasoning Traces**: All agent decisions are accompanied by a machine-readable `<thought>...</thought>` block, providing full auditability of the decision-making process.
46
+ - ๐Ÿ“Š **SOC Dashboard**: Real-time Streamlit interface for monitoring agent behavior, sandbox telemetry, and reward breakdowns.
47
+ - โœ… **OpenEnv Compliance**: Fully integrated with the PyTorch OpenEnv framework, ensuring reproducible and shareable reinforcement learning environments.
48
 
49
  ---
50
 
51
  ## ๐Ÿ› ๏ธ Project Structure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
+ ```text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  PatchHawk/
 
55
  โ”œโ”€โ”€ src/envs/patchhawk/ # ๐Ÿ“ฆ OpenEnv Submission Package
56
  โ”‚ โ”œโ”€โ”€ server/ # FastAPI environment server
57
+ โ”‚ โ”œโ”€โ”€ models.py # Type-safe contract definitions
58
  โ”‚ โ”œโ”€โ”€ client.py # Environment interaction client
59
  โ”‚ โ””โ”€โ”€ inference.py # Main agent execution loop
60
  โ”œโ”€โ”€ patchhawk/ # ๐Ÿง  Core Logic & Training
 
62
  โ”‚ โ”œโ”€โ”€ training/ # GRPO / Unsloth training scripts
63
  โ”‚ โ””โ”€โ”€ app/ # Streamlit SOC Dashboard
64
  โ”œโ”€โ”€ docker/ # ๐Ÿณ Container configurations
65
+ โ”œโ”€โ”€ assets/ # ๐Ÿ–ผ๏ธ Training plots & Media
66
  โ”œโ”€โ”€ config.yaml # Environment & Agent configuration
67
  โ”œโ”€โ”€ openenv.yaml # OpenEnv metadata
68
  โ”œโ”€โ”€ .env.example # Environment variable template
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  โ””โ”€โ”€ README.md
70
  ```
71
 
72
  ---
73
 
74
+ ## ๐Ÿš€ Getting Started
75
 
76
  ### Prerequisites
77
 
 
 
 
 
 
 
78
  - Python 3.12 or higher
79
+ - Docker Engine (with buildx support)
80
+ - NVIDIA GPU (8 GB VRAM or more recommended for training and inference)
81
  - Hugging Face account and access token
 
82
 
83
  ### Installation
84
 
 
 
85
  ```bash
86
+ # Clone the repository
87
  git clone https://github.com/ramprasathk07/PatchHawk.git
88
  cd PatchHawk
89
 
 
90
  # Create and activate a virtual environment
91
  python -m venv .venv
92
  source .venv/bin/activate # On Windows: .venv\Scripts\activate
93
 
94
  # Install core dependencies
 
 
 
 
 
95
  pip install -e .
96
  ```
97
 
98
  ### Environment Setup
99
 
100
  ```bash
 
101
  # Copy the environment template and populate your keys
102
  cp .env.example .env
103
+ # Edit .env to include HF_TOKEN, OPENAI_API_KEY, WANDB_API_KEY, etc.
104
 
105
  # Build the validation sandbox Docker image
 
 
 
 
 
106
  docker build -t patchhawk-sandbox:latest -f docker/Dockerfile.sandbox .
107
  ```
108
 
109
  ### Running the Agent
110
 
 
 
111
  ```bash
 
 
 
 
 
 
112
  # Terminal 1 โ€” environment server
113
  python -m server.app --port 8000
114
 
115
  # Terminal 2 โ€” inference loop
 
116
  python src/envs/patchhawk/inference.py --env-url http://localhost:8000
117
  ```
118
 
119
  ---
120
 
121
+ ## ๐Ÿง  Training
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
 
123
  PatchHawk uses GRPO with a 4-bit quantised Qwen2.5-Coder-7B-Instruct base model and LoRA adapters. The training script is located at `patchhawk/training/train_grpo.py`.
124
 
125
+ ### Training Progress
126
 
127
+ | GRPO Training Reward | GRPO Group Reward Variance |
128
+ | :--- | :--- |
129
+ | ![GRPO Training Reward](assets/grpo1.png) | ![GRPO Group Reward Variance](assets/grpo2.png) |
 
 
 
 
 
 
130
 
131
  **GPU training (RTX 3060 12 GB defaults)**
132
 
 
140
  --output-dir grpo_lora
141
  ```
142
 
 
 
 
143
  ---
144
 
145
+ ## ๐Ÿ’Ž Reward Rubric
 
 
 
 
 
146
 
147
+ The agent is guided by a granular reward structure that encourages safe, effective, and verifiable actions.
 
 
 
 
 
 
148
 
149
+ | Action ID | Action Name | Base Reward | Success Criteria |
150
+ | :--- | :--- | :--- | :--- |
151
+ | **0** | `ANALYZE` | `0.0` | Observation step; used solely for data gathering. |
152
+ | **1** | `DETONATE` | `+0.1` | Successfully extract telemetry from the Docker sandbox. |
153
+ | **2** | `BLOCK_PR` | `+2.0 / -1.0` | Positive reward when correctly blocking a malicious PR; negative penalty for false positives. |
154
+ | **3** | `SUBMIT_PATCH` | `+3.0 / -1.5` | The primary goal. Reward requires passing syntax check, unit tests, and a re-attack validation. |
155
+ | **4** | `ESCALATE` | `0.0` | Hands off to a human expert when uncertainty exceeds a configurable threshold. |
156
 
157
+ ### Dynamic Scaling Factors
158
+ - **Risk Accuracy Bonus**: Up to `+2.0` additional reward for accurately predicting the risk score of a vulnerability.
159
+ - **Safety Multiplier**: Repeated syntax check failures apply a decay factor to all future rewards.
160
 
161
  ---
162
 
163
+ ## ๐Ÿ“Š Dashboard
164
 
165
+ Launch the **Security Operations Center (SOC)** dashboard to observe the agent's reasoning in real time.
 
166
 
167
  ```bash
168
  streamlit run patchhawk/app/dashboard.py
169
  ```
170
 
 
171
  The dashboard provides:
172
+ - Live XML reasoning logs (`<thought>` traces) from the agent.
173
+ - Real-time stdout/stderr streams from the Docker sandbox.
174
+ - Detailed audit trail of reward assignments and verification outcomes.
175
 
176
  ---
177
 
178
  ## ๐Ÿ—บ๏ธ Roadmap & Future Work
179
 
180
+ - [ ] **Multi-Agent Coordination**: Deploy attacker and defender models for automated red-teaming exercises.
181
+ - [ ] **CVE Ingestion**: Automatically generate training scenarios from the National Vulnerability Database (NVD).
182
+ - [ ] **Cross-Language Support**: Expand beyond Python to Go, JavaScript, Rust, and Java.
183
+ - [ ] **Kubernetes Native**: Orchestrate sandboxes at scale using Kubernetes instead of local Docker.
184
+ - [ ] **Fine-Tuned Vulnerability Model**: Train a specialized 7B parameter LLM (e.g., VulnLLM-R) on vulnerability-fixing commits.
185
+ - [ ] **Context-Aware Analysis**: Integrate Code Property Graph (CPG) slicing for LLM-based semantic vulnerability detection.
186
+ - [ ] **Silent Patch Detection**: Identify security-relevant commits that were not publicly disclosed.
187
+ - [ ] **AI-Generated Code Audit**: Trace vulnerabilities back to AI coding assistants (e.g., GitHub Copilot, ChatGPT).
188
+ - [ ] **Automated PR Remediation**: Generate and submit fix-containing pull requests for detected vulnerabilities.
189
+ - [ ] **Adversarial Training Loop**: Implement a self-improving LLM-vs-LLM red-team / blue-team training regimen.
190
+ - [ ] **Supply-Chain Malware Detection**: Extend dependency analysis to identify novel, unpublished attack patterns.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
191
 
192
  ---
193
 
194
+ ## ๐Ÿ“ License
195
 
 
196
  Distributed under the **MIT License**. See the LICENSE file in the repository root for full details.
197
 
198
  Developed with โค๏ธ by **Ramprasath K & The PatchHawk Team** for the OpenEnv Hackathon 2026 hosted by Meta.
 
 
 
 
 
assets/grpo1.png ADDED
assets/grpo2.png ADDED