Spaces:
Running
Running
Add GRPO training plots and resolve README conflicts
Browse files- README.md +55 -250
- assets/grpo1.png +0 -0
- assets/grpo2.png +0 -0
README.md
CHANGED
|
@@ -1,4 +1,4 @@
|
|
| 1 |
-
# PatchHawk
|
| 2 |
|
| 3 |
[](https://wandb.ai)
|
| 4 |
[](https://huggingface.co)
|
|
@@ -6,22 +6,15 @@
|
|
| 6 |
[](https://openenv.dev)
|
| 7 |
[](https://opensource.org/licenses/MIT)
|
| 8 |
|
| 9 |
-
|
| 10 |
-
[](https://wandb.ai)
|
| 11 |
-
[](https://huggingface.co)
|
| 12 |
-
[](https://python.org)
|
| 13 |
-
[](https://openenv.dev)
|
| 14 |
-
[](https://opensource.org/licenses/MIT)
|
| 15 |
-
|
| 16 |
-
**Built for the OpenEnv Hackathon 2026 by Meta**
|
| 17 |
|
| 18 |
-
PatchHawk is an autonomous DevSecOps agent powered by Group Relative Policy Optimization (GRPO). It moves beyond static vulnerability detection by validating findings inside isolated Docker sandboxes and generating verified, syntactically correct patches. The system closes the loop between detection, validation, and remediation through a cyber
|
| 19 |
|
| 20 |
---
|
| 21 |
|
| 22 |
-
## ๐ฝ๏ธ The Vision: Cyber
|
| 23 |
|
| 24 |
-
Traditional security scanners suffer from high false
|
| 25 |
|
| 26 |
```mermaid
|
| 27 |
graph TD
|
|
@@ -33,113 +26,35 @@ graph TD
|
|
| 33 |
B -->|Patch| G[Verification Pipeline]
|
| 34 |
G -->|Syntax Check| H{Success?}
|
| 35 |
G -->|Unit Tests| I{Pass?}
|
| 36 |
-
G -->|Re
|
| 37 |
H & I & J -->|All Pass| K[Positive Reward +3.0]
|
| 38 |
H | I | J -->|Failure| L[Negative Penalty -1.5]
|
| 39 |
K --> M[Model Update / Optimization]
|
| 40 |
L --> M
|
| 41 |
```
|
| 42 |
|
| 43 |
-
The agent learns to produce patches that not only compile but also withstand re
|
| 44 |
|
| 45 |
---
|
| 46 |
|
| 47 |
## โจ Key Features
|
| 48 |
|
| 49 |
-
-
|
| 50 |
-
-
|
| 51 |
-
-
|
| 52 |
-
-
|
| 53 |
-
-
|
| 54 |
-
-
|
| 55 |
|
| 56 |
---
|
| 57 |
|
| 58 |
## ๐ ๏ธ Project Structure
|
| 59 |
-
=======
|
| 60 |
-
**Submitted to the OpenEnv Hackathon 2026 โ hosted by Meta.**
|
| 61 |
-
|
| 62 |
-
PatchHawk is an autonomous DevSecOps agent trained with Group Relative Policy Optimization (GRPO). It moves beyond static vulnerability detection by validating findings inside isolated Docker sandboxes and generating syntactically correct, re-attack-verified patches. The system closes the loop between detection, validation, and remediation through a reinforcement learning feedback cycle grounded in real execution environments.
|
| 63 |
-
|
| 64 |
-
---
|
| 65 |
-
|
| 66 |
-
## Table of Contents
|
| 67 |
-
|
| 68 |
-
- [Architecture Overview](#architecture-overview)
|
| 69 |
-
- [Key Capabilities](#key-capabilities)
|
| 70 |
-
- [Project Structure](#project-structure)
|
| 71 |
-
- [Getting Started](#getting-started)
|
| 72 |
-
- [Prerequisites](#prerequisites)
|
| 73 |
-
- [Installation](#installation)
|
| 74 |
-
- [Environment Setup](#environment-setup)
|
| 75 |
-
- [Running the Agent](#running-the-agent)
|
| 76 |
-
- [Training](#training)
|
| 77 |
-
- [Reward Rubric](#reward-rubric)
|
| 78 |
-
- [Dashboard](#dashboard)
|
| 79 |
-
- [Roadmap](#roadmap)
|
| 80 |
-
- [License](#license)
|
| 81 |
-
|
| 82 |
-
---
|
| 83 |
-
|
| 84 |
-
## Architecture Overview
|
| 85 |
-
|
| 86 |
-
Traditional security scanners suffer from high false-positive rates and produce findings that are often unexploitable or unfixable in practice. PatchHawk addresses this through a reinforcement learning loop in which the agent's reward is tied directly to the outcome of its patches inside a live execution environment.
|
| 87 |
-
|
| 88 |
-
```
|
| 89 |
-
Source Code / PR
|
| 90 |
-
|
|
| 91 |
-
v
|
| 92 |
-
PatchHawk Agent
|
| 93 |
-
/ | \
|
| 94 |
-
Analyze Test Patch
|
| 95 |
-
| |
|
| 96 |
-
Docker Verification
|
| 97 |
-
Sandbox Pipeline
|
| 98 |
-
| |
|
| 99 |
-
Behavioral Syntax Check
|
| 100 |
-
Telemetry Unit Tests
|
| 101 |
-
| Re-Attack
|
| 102 |
-
\ /
|
| 103 |
-
Reward Signal
|
| 104 |
-
|
|
| 105 |
-
Model Update
|
| 106 |
-
```
|
| 107 |
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
---
|
| 111 |
-
|
| 112 |
-
## Key Capabilities
|
| 113 |
-
|
| 114 |
-
**Autonomous Detection**
|
| 115 |
-
Comprehensive supply-chain analysis targeting typosquatting, backdoors, data exfiltration payloads, and malicious dependency logic.
|
| 116 |
-
>>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
|
| 117 |
-
|
| 118 |
-
**Hardened Sandboxing**
|
| 119 |
-
Docker-based isolation with network-disabled execution, strict resource caps, and ephemeral file systems for safe detonation of suspicious packages.
|
| 120 |
-
|
| 121 |
-
**GRPO-Driven Learning**
|
| 122 |
-
Group Relative Policy Optimization, drawing from the DeepSeek-R1 methodology, enables structured trial-and-error mastery without requiring a separate critic model.
|
| 123 |
-
|
| 124 |
-
**Structured Reasoning Traces**
|
| 125 |
-
All agent actions are accompanied by a `<thought>...</thought>` XML block logged for full decision auditability.
|
| 126 |
-
|
| 127 |
-
**SOC Dashboard**
|
| 128 |
-
Real-time Streamlit interface displaying agent reasoning, sandbox telemetry, and reward breakdowns by action type.
|
| 129 |
-
|
| 130 |
-
**OpenEnv Compliance**
|
| 131 |
-
Fully integrated with the PyTorch OpenEnv framework, ensuring reproducible and shareable reinforcement learning environments.
|
| 132 |
-
|
| 133 |
-
---
|
| 134 |
-
|
| 135 |
-
## Project Structure
|
| 136 |
-
|
| 137 |
-
```
|
| 138 |
PatchHawk/
|
| 139 |
-
<<<<<<< HEAD
|
| 140 |
โโโ src/envs/patchhawk/ # ๐ฆ OpenEnv Submission Package
|
| 141 |
โ โโโ server/ # FastAPI environment server
|
| 142 |
-
โ โโโ models.py # Type
|
| 143 |
โ โโโ client.py # Environment interaction client
|
| 144 |
โ โโโ inference.py # Main agent execution loop
|
| 145 |
โโโ patchhawk/ # ๐ง Core Logic & Training
|
|
@@ -147,140 +62,71 @@ PatchHawk/
|
|
| 147 |
โ โโโ training/ # GRPO / Unsloth training scripts
|
| 148 |
โ โโโ app/ # Streamlit SOC Dashboard
|
| 149 |
โโโ docker/ # ๐ณ Container configurations
|
|
|
|
| 150 |
โโโ config.yaml # Environment & Agent configuration
|
| 151 |
โโโ openenv.yaml # OpenEnv metadata
|
| 152 |
โโโ .env.example # Environment variable template
|
| 153 |
-
=======
|
| 154 |
-
โโโ src/
|
| 155 |
-
โ โโโ envs/
|
| 156 |
-
โ โโโ patchhawk/
|
| 157 |
-
โ โโโ server/ # FastAPI environment server
|
| 158 |
-
โ โโโ models.py # Type-safe contract definitions
|
| 159 |
-
โ โโโ client.py # Environment interaction client
|
| 160 |
-
โ โโโ inference.py # Agent execution loop
|
| 161 |
-
โโโ patchhawk/
|
| 162 |
-
โ โโโ data/ # Scenario generation and datasets
|
| 163 |
-
โ โโโ training/ # GRPO training scripts
|
| 164 |
-
โ โโโ app/ # Streamlit SOC Dashboard
|
| 165 |
-
โโโ docker/
|
| 166 |
-
โ โโโ Dockerfile.sandbox # Sandbox container configuration
|
| 167 |
-
โโโ config.yaml # Environment and agent configuration
|
| 168 |
-
โโโ openenv.yaml # OpenEnv metadata
|
| 169 |
-
โโโ .env.example # Environment variable template
|
| 170 |
-
>>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
|
| 171 |
โโโ README.md
|
| 172 |
```
|
| 173 |
|
| 174 |
---
|
| 175 |
|
| 176 |
-
## Getting Started
|
| 177 |
|
| 178 |
### Prerequisites
|
| 179 |
|
| 180 |
-
<<<<<<< HEAD
|
| 181 |
-
- Python 3.12 or higher
|
| 182 |
-
- Docker Engine (running locally, with buildx available)
|
| 183 |
-
- NVIDIA GPU (8 GB VRAM or more recommended for training and inference)
|
| 184 |
-
- Hugging Face account and token (for model access)
|
| 185 |
-
=======
|
| 186 |
- Python 3.12 or higher
|
| 187 |
-
- Docker Engine with buildx support
|
| 188 |
-
- NVIDIA GPU
|
| 189 |
- Hugging Face account and access token
|
| 190 |
-
>>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
|
| 191 |
|
| 192 |
### Installation
|
| 193 |
|
| 194 |
-
Clone the repository and install dependencies into a virtual environment.
|
| 195 |
-
|
| 196 |
```bash
|
|
|
|
| 197 |
git clone https://github.com/ramprasathk07/PatchHawk.git
|
| 198 |
cd PatchHawk
|
| 199 |
|
| 200 |
-
<<<<<<< HEAD
|
| 201 |
# Create and activate a virtual environment
|
| 202 |
python -m venv .venv
|
| 203 |
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
| 204 |
|
| 205 |
# Install core dependencies
|
| 206 |
-
=======
|
| 207 |
-
python -m venv .venv
|
| 208 |
-
source .venv/bin/activate # Windows: .venv\Scripts\activate
|
| 209 |
-
|
| 210 |
-
>>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
|
| 211 |
pip install -e .
|
| 212 |
```
|
| 213 |
|
| 214 |
### Environment Setup
|
| 215 |
|
| 216 |
```bash
|
| 217 |
-
<<<<<<< HEAD
|
| 218 |
# Copy the environment template and populate your keys
|
| 219 |
cp .env.example .env
|
| 220 |
-
# Edit .env to include HF_TOKEN, OPENAI_API_KEY,
|
| 221 |
|
| 222 |
# Build the validation sandbox Docker image
|
| 223 |
-
=======
|
| 224 |
-
cp .env.example .env
|
| 225 |
-
# Populate .env with HF_TOKEN, OPENAI_API_KEY, WANDB_API_KEY, and any other required keys.
|
| 226 |
-
|
| 227 |
-
>>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
|
| 228 |
docker build -t patchhawk-sandbox:latest -f docker/Dockerfile.sandbox .
|
| 229 |
```
|
| 230 |
|
| 231 |
### Running the Agent
|
| 232 |
|
| 233 |
-
Start the environment server and the inference loop in separate terminal sessions.
|
| 234 |
-
|
| 235 |
```bash
|
| 236 |
-
<<<<<<< HEAD
|
| 237 |
-
# Start the environment server (in one terminal)
|
| 238 |
-
python -m server.app --port 8000
|
| 239 |
-
|
| 240 |
-
# Execute the inference loop (in another terminal)
|
| 241 |
-
=======
|
| 242 |
# Terminal 1 โ environment server
|
| 243 |
python -m server.app --port 8000
|
| 244 |
|
| 245 |
# Terminal 2 โ inference loop
|
| 246 |
-
>>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
|
| 247 |
python src/envs/patchhawk/inference.py --env-url http://localhost:8000
|
| 248 |
```
|
| 249 |
|
| 250 |
---
|
| 251 |
|
| 252 |
-
|
| 253 |
-
## ๐ Reward Rubric
|
| 254 |
-
|
| 255 |
-
The agent is guided by a granular reward structure that encourages safe, effective, and verifiable actions.
|
| 256 |
-
|
| 257 |
-
| Action ID | Action Name | Base Reward | Success Criteria |
|
| 258 |
-
| :--- | :--- | :--- | :--- |
|
| 259 |
-
| **0** | `ANALYZE` | `0.0` | Observation step; used solely for data gathering. |
|
| 260 |
-
| **1** | `DETONATE` | `+0.1` | Successfully extract telemetry from the Docker sandbox. |
|
| 261 |
-
| **2** | `BLOCK_PR` | `+2.0 / -1.0` | Positive reward when correctly blocking a malicious PR; negative penalty for false positives. |
|
| 262 |
-
| **3** | `SUBMIT_PATCH` | `+3.0 / -1.5` | The primary goal. Reward requires passing syntax check, unit tests, and a reโattack validation. |
|
| 263 |
-
| **4** | `ESCALATE` | `0.0` | Hands off to a human expert when uncertainty exceeds a configurable threshold. |
|
| 264 |
-
|
| 265 |
-
### Dynamic Scaling Factors
|
| 266 |
-
- **Risk Accuracy Bonus**: Up to `+2.0` additional reward for accurately predicting the risk score of a vulnerability.
|
| 267 |
-
- **Safety Multiplier**: Repeated syntax check failures apply a decay factor to all future rewards.
|
| 268 |
-
=======
|
| 269 |
-
## Training
|
| 270 |
|
| 271 |
PatchHawk uses GRPO with a 4-bit quantised Qwen2.5-Coder-7B-Instruct base model and LoRA adapters. The training script is located at `patchhawk/training/train_grpo.py`.
|
| 272 |
|
| 273 |
-
|
| 274 |
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
**Dry run (CPU, no model required)**
|
| 280 |
-
|
| 281 |
-
```bash
|
| 282 |
-
python -m patchhawk.training.train_grpo --dry-run
|
| 283 |
-
```
|
| 284 |
|
| 285 |
**GPU training (RTX 3060 12 GB defaults)**
|
| 286 |
|
|
@@ -294,100 +140,59 @@ python -m patchhawk.training.train_grpo \
|
|
| 294 |
--output-dir grpo_lora
|
| 295 |
```
|
| 296 |
|
| 297 |
-
Key training parameters and their recommended values for a 12 GB GPU are documented inline in `train_grpo.py`. To upload the trained adapter to the Hugging Face Hub, set the `HF_REPO` environment variable before running.
|
| 298 |
-
>>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
|
| 299 |
-
|
| 300 |
---
|
| 301 |
|
| 302 |
-
## Reward Rubric
|
| 303 |
-
|
| 304 |
-
<<<<<<< HEAD
|
| 305 |
-
Launch the **Security Operations Center (SOC)** dashboard to observe the agent's reasoning in real time.
|
| 306 |
-
=======
|
| 307 |
-
The agent is guided by a granular reward structure that incentivises safe, effective, and verifiable actions.
|
| 308 |
|
| 309 |
-
|
| 310 |
-
|-----------|----------------|--------------|------------------|
|
| 311 |
-
| 0 | ANALYZE | 0.0 | Observation step; used for data gathering only. |
|
| 312 |
-
| 1 | DETONATE | +0.1 | Successful telemetry extraction from the Docker sandbox. |
|
| 313 |
-
| 2 | BLOCK\_PR | +2.0 / -1.0 | Positive reward for correctly blocking a malicious PR; penalty for false positives. |
|
| 314 |
-
| 3 | SUBMIT\_PATCH | +3.0 / -1.5 | Reward requires passing syntax check, unit tests, and re-attack validation. |
|
| 315 |
-
| 4 | ESCALATE | 0.0 | Defers to a human expert when uncertainty exceeds a configurable threshold. |
|
| 316 |
|
| 317 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 318 |
|
| 319 |
-
|
| 320 |
-
- **
|
|
|
|
| 321 |
|
| 322 |
---
|
| 323 |
|
| 324 |
-
## Dashboard
|
| 325 |
|
| 326 |
-
Launch the Security Operations
|
| 327 |
-
>>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
|
| 328 |
|
| 329 |
```bash
|
| 330 |
streamlit run patchhawk/app/dashboard.py
|
| 331 |
```
|
| 332 |
|
| 333 |
-
<<<<<<< HEAD
|
| 334 |
The dashboard provides:
|
| 335 |
-
-
|
| 336 |
-
-
|
| 337 |
-
-
|
| 338 |
|
| 339 |
---
|
| 340 |
|
| 341 |
## ๐บ๏ธ Roadmap & Future Work
|
| 342 |
|
| 343 |
-
-
|
| 344 |
-
-
|
| 345 |
-
-
|
| 346 |
-
-
|
| 347 |
-
-
|
| 348 |
-
-
|
| 349 |
-
-
|
| 350 |
-
-
|
| 351 |
-
-
|
| 352 |
-
-
|
| 353 |
-
-
|
| 354 |
-
=======
|
| 355 |
-
The dashboard exposes the following views:
|
| 356 |
-
|
| 357 |
-
- Live structured reasoning logs (`<thought>` traces) from the agent.
|
| 358 |
-
- Real-time stdout and stderr streams from the Docker sandbox.
|
| 359 |
-
- Detailed audit trail of reward assignments and verification outcomes per episode.
|
| 360 |
-
|
| 361 |
-
---
|
| 362 |
-
|
| 363 |
-
## Roadmap
|
| 364 |
-
|
| 365 |
-
The following capabilities are planned for future releases. Contributions and issue reports are welcome.
|
| 366 |
-
|
| 367 |
-
- **Multi-Agent Red-Teaming.** Deploy attacker and defender models for automated adversarial exercises.
|
| 368 |
-
- **CVE Ingestion.** Automatically generate training scenarios from the National Vulnerability Database.
|
| 369 |
-
- **Cross-Language Support.** Extend analysis beyond Python to Go, JavaScript, Rust, and Java.
|
| 370 |
-
- **Kubernetes Orchestration.** Scale sandbox execution using Kubernetes instead of local Docker.
|
| 371 |
-
- **Fine-Tuned Vulnerability Model.** Train a specialised model on vulnerability-fixing commits.
|
| 372 |
-
- **Code Property Graph Integration.** Apply CPG slicing for semantic vulnerability detection.
|
| 373 |
-
- **Silent Patch Detection.** Identify security-relevant commits that were not publicly disclosed.
|
| 374 |
-
- **AI-Generated Code Audit.** Trace vulnerabilities to AI coding assistants such as GitHub Copilot.
|
| 375 |
-
- **Automated PR Remediation.** Generate and submit fix-containing pull requests for detected issues.
|
| 376 |
-
- **Adversarial Self-Improvement.** Implement an LLM-vs-LLM red-team / blue-team training regimen.
|
| 377 |
-
- **Supply-Chain Malware Detection.** Extend dependency analysis to novel, unpublished attack patterns.
|
| 378 |
-
- **Dashboard Enhancements.** Add historical trend analysis, model performance metrics, and alerting.
|
| 379 |
-
>>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
|
| 380 |
|
| 381 |
---
|
| 382 |
|
| 383 |
-
## License
|
| 384 |
|
| 385 |
-
<<<<<<< HEAD
|
| 386 |
Distributed under the **MIT License**. See the LICENSE file in the repository root for full details.
|
| 387 |
|
| 388 |
Developed with โค๏ธ by **Ramprasath K & The PatchHawk Team** for the OpenEnv Hackathon 2026 hosted by Meta.
|
| 389 |
-
=======
|
| 390 |
-
Distributed under the MIT License. See `LICENSE` in the repository root for the full terms.
|
| 391 |
-
|
| 392 |
-
Developed by Ramprasath K and the PatchHawk team.
|
| 393 |
-
>>>>>>> 05e09d6e3aa6dfea454f54a20062bd90863a8b86
|
|
|
|
| 1 |
+
# ๐ฆ
PatchHawk: Autonomous Supply-Chain Guard
|
| 2 |
|
| 3 |
[](https://wandb.ai)
|
| 4 |
[](https://huggingface.co)
|
|
|
|
| 6 |
[](https://openenv.dev)
|
| 7 |
[](https://opensource.org/licenses/MIT)
|
| 8 |
|
| 9 |
+
**Submitted to the OpenEnv Hackathon 2026 โ hosted by Meta.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
+
PatchHawk is an autonomous DevSecOps agent powered by Group Relative Policy Optimization (GRPO). It moves beyond static vulnerability detection by validating findings inside isolated Docker sandboxes and generating verified, syntactically correct patches. The system closes the loop between detection, validation, and remediation through a cyber-physical reinforcement learning feedback cycle grounded in real execution environments.
|
| 12 |
|
| 13 |
---
|
| 14 |
|
| 15 |
+
## ๐ฝ๏ธ The Vision: Cyber-Physical RL Loop
|
| 16 |
|
| 17 |
+
Traditional security scanners suffer from high false-positive rates and often report vulnerabilities that cannot be exploited or fixed in practice. PatchHawk addresses this by implementing a reinforcement learning loop where the model's reward is tied directly to the success of its patches inside a real execution environment.
|
| 18 |
|
| 19 |
```mermaid
|
| 20 |
graph TD
|
|
|
|
| 26 |
B -->|Patch| G[Verification Pipeline]
|
| 27 |
G -->|Syntax Check| H{Success?}
|
| 28 |
G -->|Unit Tests| I{Pass?}
|
| 29 |
+
G -->|Re-Attack| J{Defeated?}
|
| 30 |
H & I & J -->|All Pass| K[Positive Reward +3.0]
|
| 31 |
H | I | J -->|Failure| L[Negative Penalty -1.5]
|
| 32 |
K --> M[Model Update / Optimization]
|
| 33 |
L --> M
|
| 34 |
```
|
| 35 |
|
| 36 |
+
The agent learns to produce patches that not only compile but also withstand re-execution of the original exploit vector. Every decision is accompanied by a structured `<thought>` block, providing a complete and machine-readable audit trail.
|
| 37 |
|
| 38 |
---
|
| 39 |
|
| 40 |
## โจ Key Features
|
| 41 |
|
| 42 |
+
- ๐ก๏ธ **Autonomous Detection**: Sophisticated supply-chain analysis identifying typosquatting, backdoors, data exfiltration, and malicious logic in dependencies.
|
| 43 |
+
- ๐ณ **Hardened Sandboxing**: High-fidelity Docker isolation with network-disabled execution, strict resource caps, and ephemeral file systems to safely detonate suspicious code.
|
| 44 |
+
- ๐ง **GRPO-Driven Learning**: Group Relative Policy Optimization (inspired by DeepSeek-R1) enables trial-and-error mastery and structured reasoning without a separate critic model.
|
| 45 |
+
- ๐งฉ **XML Reasoning Traces**: All agent decisions are accompanied by a machine-readable `<thought>...</thought>` block, providing full auditability of the decision-making process.
|
| 46 |
+
- ๐ **SOC Dashboard**: Real-time Streamlit interface for monitoring agent behavior, sandbox telemetry, and reward breakdowns.
|
| 47 |
+
- โ
**OpenEnv Compliance**: Fully integrated with the PyTorch OpenEnv framework, ensuring reproducible and shareable reinforcement learning environments.
|
| 48 |
|
| 49 |
---
|
| 50 |
|
| 51 |
## ๐ ๏ธ Project Structure
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
+
```text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
PatchHawk/
|
|
|
|
| 55 |
โโโ src/envs/patchhawk/ # ๐ฆ OpenEnv Submission Package
|
| 56 |
โ โโโ server/ # FastAPI environment server
|
| 57 |
+
โ โโโ models.py # Type-safe contract definitions
|
| 58 |
โ โโโ client.py # Environment interaction client
|
| 59 |
โ โโโ inference.py # Main agent execution loop
|
| 60 |
โโโ patchhawk/ # ๐ง Core Logic & Training
|
|
|
|
| 62 |
โ โโโ training/ # GRPO / Unsloth training scripts
|
| 63 |
โ โโโ app/ # Streamlit SOC Dashboard
|
| 64 |
โโโ docker/ # ๐ณ Container configurations
|
| 65 |
+
โโโ assets/ # ๐ผ๏ธ Training plots & Media
|
| 66 |
โโโ config.yaml # Environment & Agent configuration
|
| 67 |
โโโ openenv.yaml # OpenEnv metadata
|
| 68 |
โโโ .env.example # Environment variable template
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
โโโ README.md
|
| 70 |
```
|
| 71 |
|
| 72 |
---
|
| 73 |
|
| 74 |
+
## ๐ Getting Started
|
| 75 |
|
| 76 |
### Prerequisites
|
| 77 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
- Python 3.12 or higher
|
| 79 |
+
- Docker Engine (with buildx support)
|
| 80 |
+
- NVIDIA GPU (8 GB VRAM or more recommended for training and inference)
|
| 81 |
- Hugging Face account and access token
|
|
|
|
| 82 |
|
| 83 |
### Installation
|
| 84 |
|
|
|
|
|
|
|
| 85 |
```bash
|
| 86 |
+
# Clone the repository
|
| 87 |
git clone https://github.com/ramprasathk07/PatchHawk.git
|
| 88 |
cd PatchHawk
|
| 89 |
|
|
|
|
| 90 |
# Create and activate a virtual environment
|
| 91 |
python -m venv .venv
|
| 92 |
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
| 93 |
|
| 94 |
# Install core dependencies
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
pip install -e .
|
| 96 |
```
|
| 97 |
|
| 98 |
### Environment Setup
|
| 99 |
|
| 100 |
```bash
|
|
|
|
| 101 |
# Copy the environment template and populate your keys
|
| 102 |
cp .env.example .env
|
| 103 |
+
# Edit .env to include HF_TOKEN, OPENAI_API_KEY, WANDB_API_KEY, etc.
|
| 104 |
|
| 105 |
# Build the validation sandbox Docker image
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
docker build -t patchhawk-sandbox:latest -f docker/Dockerfile.sandbox .
|
| 107 |
```
|
| 108 |
|
| 109 |
### Running the Agent
|
| 110 |
|
|
|
|
|
|
|
| 111 |
```bash
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
# Terminal 1 โ environment server
|
| 113 |
python -m server.app --port 8000
|
| 114 |
|
| 115 |
# Terminal 2 โ inference loop
|
|
|
|
| 116 |
python src/envs/patchhawk/inference.py --env-url http://localhost:8000
|
| 117 |
```
|
| 118 |
|
| 119 |
---
|
| 120 |
|
| 121 |
+
## ๐ง Training
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
PatchHawk uses GRPO with a 4-bit quantised Qwen2.5-Coder-7B-Instruct base model and LoRA adapters. The training script is located at `patchhawk/training/train_grpo.py`.
|
| 124 |
|
| 125 |
+
### Training Progress
|
| 126 |
|
| 127 |
+
| GRPO Training Reward | GRPO Group Reward Variance |
|
| 128 |
+
| :--- | :--- |
|
| 129 |
+
|  |  |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
|
| 131 |
**GPU training (RTX 3060 12 GB defaults)**
|
| 132 |
|
|
|
|
| 140 |
--output-dir grpo_lora
|
| 141 |
```
|
| 142 |
|
|
|
|
|
|
|
|
|
|
| 143 |
---
|
| 144 |
|
| 145 |
+
## ๐ Reward Rubric
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 146 |
|
| 147 |
+
The agent is guided by a granular reward structure that encourages safe, effective, and verifiable actions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
|
| 149 |
+
| Action ID | Action Name | Base Reward | Success Criteria |
|
| 150 |
+
| :--- | :--- | :--- | :--- |
|
| 151 |
+
| **0** | `ANALYZE` | `0.0` | Observation step; used solely for data gathering. |
|
| 152 |
+
| **1** | `DETONATE` | `+0.1` | Successfully extract telemetry from the Docker sandbox. |
|
| 153 |
+
| **2** | `BLOCK_PR` | `+2.0 / -1.0` | Positive reward when correctly blocking a malicious PR; negative penalty for false positives. |
|
| 154 |
+
| **3** | `SUBMIT_PATCH` | `+3.0 / -1.5` | The primary goal. Reward requires passing syntax check, unit tests, and a re-attack validation. |
|
| 155 |
+
| **4** | `ESCALATE` | `0.0` | Hands off to a human expert when uncertainty exceeds a configurable threshold. |
|
| 156 |
|
| 157 |
+
### Dynamic Scaling Factors
|
| 158 |
+
- **Risk Accuracy Bonus**: Up to `+2.0` additional reward for accurately predicting the risk score of a vulnerability.
|
| 159 |
+
- **Safety Multiplier**: Repeated syntax check failures apply a decay factor to all future rewards.
|
| 160 |
|
| 161 |
---
|
| 162 |
|
| 163 |
+
## ๐ Dashboard
|
| 164 |
|
| 165 |
+
Launch the **Security Operations Center (SOC)** dashboard to observe the agent's reasoning in real time.
|
|
|
|
| 166 |
|
| 167 |
```bash
|
| 168 |
streamlit run patchhawk/app/dashboard.py
|
| 169 |
```
|
| 170 |
|
|
|
|
| 171 |
The dashboard provides:
|
| 172 |
+
- Live XML reasoning logs (`<thought>` traces) from the agent.
|
| 173 |
+
- Real-time stdout/stderr streams from the Docker sandbox.
|
| 174 |
+
- Detailed audit trail of reward assignments and verification outcomes.
|
| 175 |
|
| 176 |
---
|
| 177 |
|
| 178 |
## ๐บ๏ธ Roadmap & Future Work
|
| 179 |
|
| 180 |
+
- [ ] **Multi-Agent Coordination**: Deploy attacker and defender models for automated red-teaming exercises.
|
| 181 |
+
- [ ] **CVE Ingestion**: Automatically generate training scenarios from the National Vulnerability Database (NVD).
|
| 182 |
+
- [ ] **Cross-Language Support**: Expand beyond Python to Go, JavaScript, Rust, and Java.
|
| 183 |
+
- [ ] **Kubernetes Native**: Orchestrate sandboxes at scale using Kubernetes instead of local Docker.
|
| 184 |
+
- [ ] **Fine-Tuned Vulnerability Model**: Train a specialized 7B parameter LLM (e.g., VulnLLM-R) on vulnerability-fixing commits.
|
| 185 |
+
- [ ] **Context-Aware Analysis**: Integrate Code Property Graph (CPG) slicing for LLM-based semantic vulnerability detection.
|
| 186 |
+
- [ ] **Silent Patch Detection**: Identify security-relevant commits that were not publicly disclosed.
|
| 187 |
+
- [ ] **AI-Generated Code Audit**: Trace vulnerabilities back to AI coding assistants (e.g., GitHub Copilot, ChatGPT).
|
| 188 |
+
- [ ] **Automated PR Remediation**: Generate and submit fix-containing pull requests for detected vulnerabilities.
|
| 189 |
+
- [ ] **Adversarial Training Loop**: Implement a self-improving LLM-vs-LLM red-team / blue-team training regimen.
|
| 190 |
+
- [ ] **Supply-Chain Malware Detection**: Extend dependency analysis to identify novel, unpublished attack patterns.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 191 |
|
| 192 |
---
|
| 193 |
|
| 194 |
+
## ๐ License
|
| 195 |
|
|
|
|
| 196 |
Distributed under the **MIT License**. See the LICENSE file in the repository root for full details.
|
| 197 |
|
| 198 |
Developed with โค๏ธ by **Ramprasath K & The PatchHawk Team** for the OpenEnv Hackathon 2026 hosted by Meta.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
assets/grpo1.png
ADDED
|
assets/grpo2.png
ADDED
|