File size: 6,777 Bytes
084f95a 458c5ca 084f95a cffa613 62e6fee cffa613 62e6fee 4ac95cc 62e6fee 9541ba6 cffa613 9541ba6 cffa613 4f34790 9541ba6 084f95a 9541ba6 084f95a 9541ba6 cffa613 084f95a cffa613 084f95a cffa613 80b34d1 9541ba6 458c5ca 9541ba6 084f95a cffa613 084f95a cffa613 458c5ca cffa613 084f95a cffa613 9541ba6 cffa613 9541ba6 084f95a cffa613 084f95a 9541ba6 cffa613 9541ba6 cffa613 084f95a cffa613 084f95a cffa613 084f95a 4f34790 cffa613 9541ba6 458c5ca 084f95a 9541ba6 4f34790 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | ---
title: Dynamic Guardrail Generator
emoji: 🛡️
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
---
# Dynamic Guardrail Generator
**Team Winnovators (Rithwik & Parveshh)**
<div align="center">
[](https://huggingface.co/spaces/rithwik-ravikumar/OpenEnv-Dynamic-Guardrails)
[](https://youtu.be/Ae9oubiVh4E)
[](https://colab.research.google.com/drive/1LIGdmIs4sFQ21-e5Bm7Kz3noYmMpFEkd?usp=sharing)
</div>
---
## 🛑 The Problem Space
Enterprise AI adoption is soaring at 78%, yet **95% of GenAI pilots fail to reach production readiness** due to critical security and compliance roadblocks. With the average cost of a data breach hitting **$4.44 million**, deploying unprotected LLMs is an existential business risk.
The apex predator of these threats is **OWASP Top 10 LLM01: Prompt Injection**.
Current industry solutions are fatally flawed:
- **Static Regex/Heuristics:** Semantically blind and trivially bypassed by modern adversarial jailbreaks.
- **"LLM-as-a-Judge" Architectures:** Introduce massive >500ms latency bottlenecks per inference and ruinous compute overhead, destroying user experience.
- **The "Alignment Tax" & Refusal Collapse:** When guardrail models are trained via standard supervised safety tuning, they treat security as a binary token. This leads to *Refusal Collapse*—the system becomes so paranoid that it suffers a **41%+ false positive drop rate**, blocking perfectly benign user traffic and obliterating the product's core utility.
---
## 💡 Our Solution: The OpenEnv Compiler Architecture
Aligning with **Theme #4: Self-Improvement**, we solved this by separating the intelligence from the execution.
The **Dynamic Guardrail Generator** treats the LLM as an autonomous Blue-Team engineer. Running inside our strict `OpenEnv` grading environment, the agent does not evaluate prompts directly. Instead, it synthesizes a highly constrained, Pydantic-validated **JSON Guardrail Logic Graph** (a Domain Specific Language).
By forcing the agent to map threats to a structured AST using strict `LogicNodes` (`AND`, `OR`, `NOT`) and `SemanticFilters` (such as `entropy_threshold`, `length_limit`, `regex_pattern`, and `keyword_match`), we entirely bypass brittle spaghetti-code generation, eliminate runtime hallucinations, and execute the defense with zero-latency deterministic logic.
---
## ⚙️ Reward Engineering & Pipeline
To train our autonomous compiler, we built a High-Fidelity RLVR (Reinforcement Learning with Verifiable Rewards) pipeline.
### 🏗️ Execution & Reward Sandboxing
- **Real-Time AST Validation:** Every generated graph is strictly validated via Pydantic (`min_length=1` constraints on all logic trees) to guarantee structural integrity and prevent zero-cliff bypasses.
- **Micro-Sandboxing (ReDoS Guard):** The AST evaluation engine runs within an isolated environment utilizing strict `50ms` execution timeouts and recursive depth limiters to neutralize Regular Expression Denial of Service (ReDoS) from LLM-generated patterns.
### The Log-Barrier Multi-Objective Reward
To mathematically eradicate "Refusal Collapse", we designed a rigorous deterministic reward surface:
```python
Reward = (1.0 * Recall) - (2.0 * math.log1p(FPR))
```
- **Recall (True Positive Rate):** A linear reward for successfully neutralizing adversarial payloads.
- **FPR (False Positive Rate):** A severe non-linear logarithmic penalty for blocking benign user queries, mathematically forcing the agent to preserve application utility.
### Dual-Compute Strategy
We utilized **Unsloth (4-bit quantization)** and **Hugging Face TRL (GRPO)** on `Qwen/Qwen2.5-0.5B-Instruct` to keep the memory footprint under 8GB VRAM.
- **Cloud Proof of Concept:** We provided a verifiable Google Colab notebook running on a T4 GPU as a 4-step proof of learning.
- **Local High-Fidelity Training:** Our actual production LoRA adapter was trained locally for 250 steps on a dedicated **RTX 4070 GPU** to achieve high-fidelity semantic parsing and complex graph synthesis.
---
## 📈 Results & UI Dashboard
Our training resulted in an agent capable of generating highly targeted logic graphs that dynamically adapt to new threat vectors.

*Figure 1: GRPO Training Curve demonstrating the agent escaping refusal-collapse.*
### Decoupled Telemetry & Live A/B Comparison
We built a rich, non-blocking telemetry dashboard (`FastAPI` + Server-Sent Events) that streams live metrics without impacting the execution time of the strict OpenEnv evaluation loop.
Our UI features a **Live A/B Performance Delta** capability. The `evaluate.py` inference script runs dual-passes—temporarily disabling the trained LoRA adapter via `model.disable_adapter()` to evaluate the base Qwen2.5 weights against our RL-trained agent in real-time. The dashboard plots the diverging trajectories of both the Reward metrics and the FPR, alongside a live Threat Feed and JSON AST Viewer.
---
## 💻 Local Run Instructions
We have battle-tested this environment specifically for Windows local deployments.
### 1. Windows GPU Setup (Critical Fixes)
To bypass known PyTorch and Triton compiler conflicts on Windows, you must configure your environment exactly as follows:
1. **Python Version:** Create a virtual environment using **Python 3.13** (Avoid Python 3.14 to maintain dependency compatibility).
2. **Install PyTorch 2.11 (CUDA 12.6):** Standard `requirements.txt` installs will pull CPU wheels. You must install PyTorch from the `cu126` index:
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126 --upgrade
```
3. **Install Dependencies & Triton Compiler:**
```bash
pip install -r requirements.txt
pip install triton-windows
```
*(Note: If Triton throws a `Python.h` missing error, create a directory junction linking your base Python `include` folder to your project root `Include` folder).*
### 2. Run the Master Orchestrator
We have bundled a master orchestrator (`run_all.py`) that automatically cleans up zombie ports, boots the merged Core API & Telemetry UI Server (Port 8000) into the background, and triggers the Headless OpenEnv Evaluator (`evaluate.py`).
```bash
python run_all.py
```
### 3. View the Dashboard
Once the orchestrator initializes, open your browser to:
[http://127.0.0.1:8000](http://127.0.0.1:8000) to watch the live A/B comparison and Threat Feed stream in real-time.
|