Spaces:

rithwik-ravikumar
/

OpenEnv-Dynamic-Guardrails

Sleeping

App Files Files Community

OpenEnv-Dynamic-Guardrails / README.md

Rithwik Ravi

Final submission: Finalized README with artifacts and video link

4ac95cc 24 days ago

preview code

raw

history blame contribute delete

6.78 kB

	---
	title: Dynamic Guardrail Generator
	emoji: 🛡️
	colorFrom: blue
	colorTo: purple
	sdk: docker
	pinned: false
	license: mit
	---
	# Dynamic Guardrail Generator
	Team Winnovators (Rithwik & Parveshh)

	<div align="center">

	[![Hugging Face Space](https://img.shields.io/badge/🤗_Live_Demo-Hugging_Face-FFD21E?style=for-the-badge)](https://huggingface.co/spaces/rithwik-ravikumar/OpenEnv-Dynamic-Guardrails)
	[![YouTube Pitch](https://img.shields.io/badge/▶️_Pitch_Video-YouTube-FF0000?style=for-the-badge)](https://youtu.be/Ae9oubiVh4E)
	[![Google Colab](https://img.shields.io/badge/🪐_Training_PoC-Google_Colab-F9AB00?style=for-the-badge)](https://colab.research.google.com/drive/1LIGdmIs4sFQ21-e5Bm7Kz3noYmMpFEkd?usp=sharing)

	</div>
	---

	## 🛑 The Problem Space

	Enterprise AI adoption is soaring at 78%, yet 95% of GenAI pilots fail to reach production readiness due to critical security and compliance roadblocks. With the average cost of a data breach hitting $4.44 million, deploying unprotected LLMs is an existential business risk.

	The apex predator of these threats is OWASP Top 10 LLM01: Prompt Injection.

	Current industry solutions are fatally flawed:
	- Static Regex/Heuristics: Semantically blind and trivially bypassed by modern adversarial jailbreaks.
	- "LLM-as-a-Judge" Architectures: Introduce massive >500ms latency bottlenecks per inference and ruinous compute overhead, destroying user experience.
	- The "Alignment Tax" & Refusal Collapse: When guardrail models are trained via standard supervised safety tuning, they treat security as a binary token. This leads to Refusal Collapse—the system becomes so paranoid that it suffers a 41%+ false positive drop rate, blocking perfectly benign user traffic and obliterating the product's core utility.

	---

	## 💡 Our Solution: The OpenEnv Compiler Architecture

	Aligning with Theme #4: Self-Improvement, we solved this by separating the intelligence from the execution.

	The Dynamic Guardrail Generator treats the LLM as an autonomous Blue-Team engineer. Running inside our strict `OpenEnv` grading environment, the agent does not evaluate prompts directly. Instead, it synthesizes a highly constrained, Pydantic-validated JSON Guardrail Logic Graph (a Domain Specific Language).

	By forcing the agent to map threats to a structured AST using strict `LogicNodes` (`AND`, `OR`, `NOT`) and `SemanticFilters` (such as `entropy_threshold`, `length_limit`, `regex_pattern`, and `keyword_match`), we entirely bypass brittle spaghetti-code generation, eliminate runtime hallucinations, and execute the defense with zero-latency deterministic logic.

	---

	## ⚙️ Reward Engineering & Pipeline

	To train our autonomous compiler, we built a High-Fidelity RLVR (Reinforcement Learning with Verifiable Rewards) pipeline.

	### 🏗️ Execution & Reward Sandboxing
	- Real-Time AST Validation: Every generated graph is strictly validated via Pydantic (`min_length=1` constraints on all logic trees) to guarantee structural integrity and prevent zero-cliff bypasses.
	- Micro-Sandboxing (ReDoS Guard): The AST evaluation engine runs within an isolated environment utilizing strict `50ms` execution timeouts and recursive depth limiters to neutralize Regular Expression Denial of Service (ReDoS) from LLM-generated patterns.

	### The Log-Barrier Multi-Objective Reward
	To mathematically eradicate "Refusal Collapse", we designed a rigorous deterministic reward surface:
	```python
	Reward = (1.0 * Recall) - (2.0 * math.log1p(FPR))
	```
	- Recall (True Positive Rate): A linear reward for successfully neutralizing adversarial payloads.
	- FPR (False Positive Rate): A severe non-linear logarithmic penalty for blocking benign user queries, mathematically forcing the agent to preserve application utility.

	### Dual-Compute Strategy
	We utilized Unsloth (4-bit quantization) and Hugging Face TRL (GRPO) on `Qwen/Qwen2.5-0.5B-Instruct` to keep the memory footprint under 8GB VRAM.
	- Cloud Proof of Concept: We provided a verifiable Google Colab notebook running on a T4 GPU as a 4-step proof of learning.
	- Local High-Fidelity Training: Our actual production LoRA adapter was trained locally for 250 steps on a dedicated RTX 4070 GPU to achieve high-fidelity semantic parsing and complex graph synthesis.

	---

	## 📈 Results & UI Dashboard

	Our training resulted in an agent capable of generating highly targeted logic graphs that dynamically adapt to new threat vectors.

	![Training Reward Curve](reward_curve.png)
	Figure 1: GRPO Training Curve demonstrating the agent escaping refusal-collapse.

	### Decoupled Telemetry & Live A/B Comparison
	We built a rich, non-blocking telemetry dashboard (`FastAPI` + Server-Sent Events) that streams live metrics without impacting the execution time of the strict OpenEnv evaluation loop.

	Our UI features a Live A/B Performance Delta capability. The `evaluate.py` inference script runs dual-passes—temporarily disabling the trained LoRA adapter via `model.disable_adapter()` to evaluate the base Qwen2.5 weights against our RL-trained agent in real-time. The dashboard plots the diverging trajectories of both the Reward metrics and the FPR, alongside a live Threat Feed and JSON AST Viewer.

	---

	## 💻 Local Run Instructions

	We have battle-tested this environment specifically for Windows local deployments.

	### 1. Windows GPU Setup (Critical Fixes)
	To bypass known PyTorch and Triton compiler conflicts on Windows, you must configure your environment exactly as follows:

	1. Python Version: Create a virtual environment using Python 3.13 (Avoid Python 3.14 to maintain dependency compatibility).
	2. Install PyTorch 2.11 (CUDA 12.6): Standard `requirements.txt` installs will pull CPU wheels. You must install PyTorch from the `cu126` index:
	```bash
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126 --upgrade
	```
	3. Install Dependencies & Triton Compiler:
	```bash
	pip install -r requirements.txt
	pip install triton-windows
	```
	(Note: If Triton throws a `Python.h` missing error, create a directory junction linking your base Python `include` folder to your project root `Include` folder).

	### 2. Run the Master Orchestrator
	We have bundled a master orchestrator (`run_all.py`) that automatically cleans up zombie ports, boots the merged Core API & Telemetry UI Server (Port 8000) into the background, and triggers the Headless OpenEnv Evaluator (`evaluate.py`).

	```bash
	python run_all.py
	```

	### 3. View the Dashboard
	Once the orchestrator initializes, open your browser to:
	[http://127.0.0.1:8000](http://127.0.0.1:8000) to watch the live A/B comparison and Threat Feed stream in real-time.