Spaces:

rithwik-ravikumar
/

OpenEnv-Dynamic-Guardrails

Sleeping

File size: 6,777 Bytes

084f95a
 
 
 
 
 
 
 
 
458c5ca
084f95a
cffa613
62e6fee
cffa613
62e6fee
4ac95cc
62e6fee
 
 
9541ba6
 
 
 
 
 
 
 
 
 
 
 
cffa613
 
 
9541ba6
cffa613
4f34790
9541ba6
084f95a
9541ba6
084f95a
9541ba6
 
cffa613
084f95a
cffa613
084f95a
cffa613
80b34d1
 
 
 
9541ba6
 
458c5ca
 
 
9541ba6
084f95a
cffa613
084f95a
 
 
 
cffa613
458c5ca
cffa613
084f95a
cffa613
9541ba6
cffa613
9541ba6
084f95a
cffa613
084f95a
 
9541ba6
 
cffa613
 
 
9541ba6
cffa613
084f95a
cffa613
084f95a
 
 
 
 
 
 
 
 
 
 
 
 
 
cffa613
084f95a
4f34790
cffa613
9541ba6
 
 
458c5ca
084f95a
9541ba6
4f34790

---
title: Dynamic Guardrail Generator
emoji: 🛡️
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
---
# Dynamic Guardrail Generator
**Team Winnovators (Rithwik & Parveshh)**

<div align="center">

[![Hugging Face Space](https://img.shields.io/badge/🤗_Live_Demo-Hugging_Face-FFD21E?style=for-the-badge)](https://huggingface.co/spaces/rithwik-ravikumar/OpenEnv-Dynamic-Guardrails)
[![YouTube Pitch](https://img.shields.io/badge/▶️_Pitch_Video-YouTube-FF0000?style=for-the-badge)](https://youtu.be/Ae9oubiVh4E)
[![Google Colab](https://img.shields.io/badge/🪐_Training_PoC-Google_Colab-F9AB00?style=for-the-badge)](https://colab.research.google.com/drive/1LIGdmIs4sFQ21-e5Bm7Kz3noYmMpFEkd?usp=sharing)

</div>
---

## 🛑 The Problem Space

Enterprise AI adoption is soaring at 78%, yet **95% of GenAI pilots fail to reach production readiness** due to critical security and compliance roadblocks. With the average cost of a data breach hitting **$4.44 million**, deploying unprotected LLMs is an existential business risk.

The apex predator of these threats is **OWASP Top 10 LLM01: Prompt Injection**. 

Current industry solutions are fatally flawed:
- **Static Regex/Heuristics:** Semantically blind and trivially bypassed by modern adversarial jailbreaks.
- **"LLM-as-a-Judge" Architectures:** Introduce massive >500ms latency bottlenecks per inference and ruinous compute overhead, destroying user experience.
- **The "Alignment Tax" & Refusal Collapse:** When guardrail models are trained via standard supervised safety tuning, they treat security as a binary token. This leads to *Refusal Collapse*—the system becomes so paranoid that it suffers a **41%+ false positive drop rate**, blocking perfectly benign user traffic and obliterating the product's core utility.

---

## 💡 Our Solution: The OpenEnv Compiler Architecture

Aligning with **Theme #4: Self-Improvement**, we solved this by separating the intelligence from the execution. 

The **Dynamic Guardrail Generator** treats the LLM as an autonomous Blue-Team engineer. Running inside our strict `OpenEnv` grading environment, the agent does not evaluate prompts directly. Instead, it synthesizes a highly constrained, Pydantic-validated **JSON Guardrail Logic Graph** (a Domain Specific Language). 

By forcing the agent to map threats to a structured AST using strict `LogicNodes` (`AND`, `OR`, `NOT`) and `SemanticFilters` (such as `entropy_threshold`, `length_limit`, `regex_pattern`, and `keyword_match`), we entirely bypass brittle spaghetti-code generation, eliminate runtime hallucinations, and execute the defense with zero-latency deterministic logic.

---

## ⚙️ Reward Engineering & Pipeline

To train our autonomous compiler, we built a High-Fidelity RLVR (Reinforcement Learning with Verifiable Rewards) pipeline.

### 🏗️ Execution & Reward Sandboxing
- **Real-Time AST Validation:** Every generated graph is strictly validated via Pydantic (`min_length=1` constraints on all logic trees) to guarantee structural integrity and prevent zero-cliff bypasses.
- **Micro-Sandboxing (ReDoS Guard):** The AST evaluation engine runs within an isolated environment utilizing strict `50ms` execution timeouts and recursive depth limiters to neutralize Regular Expression Denial of Service (ReDoS) from LLM-generated patterns.

### The Log-Barrier Multi-Objective Reward
To mathematically eradicate "Refusal Collapse", we designed a rigorous deterministic reward surface:
```python
Reward = (1.0 * Recall) - (2.0 * math.log1p(FPR))
```
- **Recall (True Positive Rate):** A linear reward for successfully neutralizing adversarial payloads.
- **FPR (False Positive Rate):** A severe non-linear logarithmic penalty for blocking benign user queries, mathematically forcing the agent to preserve application utility.

### Dual-Compute Strategy
We utilized **Unsloth (4-bit quantization)** and **Hugging Face TRL (GRPO)** on `Qwen/Qwen2.5-0.5B-Instruct` to keep the memory footprint under 8GB VRAM. 
- **Cloud Proof of Concept:** We provided a verifiable Google Colab notebook running on a T4 GPU as a 4-step proof of learning.
- **Local High-Fidelity Training:** Our actual production LoRA adapter was trained locally for 250 steps on a dedicated **RTX 4070 GPU** to achieve high-fidelity semantic parsing and complex graph synthesis.

---

## 📈 Results & UI Dashboard

Our training resulted in an agent capable of generating highly targeted logic graphs that dynamically adapt to new threat vectors.

![Training Reward Curve](reward_curve.png)
*Figure 1: GRPO Training Curve demonstrating the agent escaping refusal-collapse.*

### Decoupled Telemetry & Live A/B Comparison
We built a rich, non-blocking telemetry dashboard (`FastAPI` + Server-Sent Events) that streams live metrics without impacting the execution time of the strict OpenEnv evaluation loop.

Our UI features a **Live A/B Performance Delta** capability. The `evaluate.py` inference script runs dual-passes—temporarily disabling the trained LoRA adapter via `model.disable_adapter()` to evaluate the base Qwen2.5 weights against our RL-trained agent in real-time. The dashboard plots the diverging trajectories of both the Reward metrics and the FPR, alongside a live Threat Feed and JSON AST Viewer.

---

## 💻 Local Run Instructions

We have battle-tested this environment specifically for Windows local deployments.

### 1. Windows GPU Setup (Critical Fixes)
To bypass known PyTorch and Triton compiler conflicts on Windows, you must configure your environment exactly as follows:

1. **Python Version:** Create a virtual environment using **Python 3.13** (Avoid Python 3.14 to maintain dependency compatibility).
2. **Install PyTorch 2.11 (CUDA 12.6):** Standard `requirements.txt` installs will pull CPU wheels. You must install PyTorch from the `cu126` index:
   ```bash
   pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126 --upgrade
   ```
3. **Install Dependencies & Triton Compiler:**
   ```bash
   pip install -r requirements.txt
   pip install triton-windows
   ```
*(Note: If Triton throws a `Python.h` missing error, create a directory junction linking your base Python `include` folder to your project root `Include` folder).*

### 2. Run the Master Orchestrator
We have bundled a master orchestrator (`run_all.py`) that automatically cleans up zombie ports, boots the merged Core API & Telemetry UI Server (Port 8000) into the background, and triggers the Headless OpenEnv Evaluator (`evaluate.py`).

```bash
python run_all.py
```

### 3. View the Dashboard
Once the orchestrator initializes, open your browser to:
[http://127.0.0.1:8000](http://127.0.0.1:8000) to watch the live A/B comparison and Threat Feed stream in real-time.