SafeRoute / README.md
StevenMup2004's picture
Upload folder using huggingface_hub
91a7101 verified
|
Raw
History Blame Contribute Delete
3.83 kB
---
language:
- en
- vi
tags:
- safety
- guardrail
- routing
- pytorch
- tabular-classification
metrics:
- f1
- accuracy
- precision
- recall
---
# SafeRoute Router Model (DynaGuard 1.7B / 8B)
This repository contains the weights for the **SafeRoute Router**, an optimized neural router designed to dynamically direct input prompts/responses between a lightweight safety classifier (Small Model) and a high-capacity safety classifier (Large Model).
By routing "easy/safe" queries to the small model and reserving the large model only for "hard/unsafe" queries, the system drastically reduces inference latency and computational cost while preserving overall safety evaluation performance.
## Model Details
- **Architecture:** Multi-Layer Perceptron (MLP) with 3 hidden layers (`1024 -> 512 -> 256`), utilizing `BatchNorm1d`, `GELU` activations, and moderate `Dropout` (0.3).
- **Input Dimension:** `2048` (feature embeddings extracted from the small safety model).
- **Output Dimension:** `1` (binary classification logit indicating routing probability).
- **Loss Function:** `Focal Loss` ($\alpha=0.75, \gamma=2.0$) tailored to address severe class imbalance.
- **Optimizer & Scheduler:** `AdamW` with `CosineAnnealingWarmRestarts`.
## Evaluation Results
Evaluated on a balanced Test Benchmark at the optimal decision threshold (**0.6**):
| Metric | Score |
| :--- | :---: |
| **F1 Score** | **0.7525** |
| **Accuracy** | **0.7500** |
| **Precision** | **0.7451** |
| **Recall** | **0.7600** |
| **Overall AUPRC** | **0.7588** |
*Note: The high recall (0.76) combined with solid precision (0.74) ensures that potentially unsafe or ambiguous prompts are reliably intercepted and routed to the Large Model for thorough inspection.*
## How to Get Started with the Model
You can easily download and use this model in your PyTorch pipeline:
```python
import torch
import torch.nn as nn
from huggingface_hub import hf_hub_download
# 1. Define the Router Architecture
class RouterMLP(nn.Module):
def __init__(self, input_dim=2048):
super().__init__()
self.cls = nn.Sequential(
nn.Linear(input_dim, 1024),
nn.BatchNorm1d(1024),
nn.GELU(),
nn.Dropout(0.3),
nn.Linear(1024, 512),
nn.BatchNorm1d(512),
nn.GELU(),
nn.Dropout(0.3),
nn.Linear(512, 256),
nn.BatchNorm1d(256),
nn.GELU(),
nn.Dropout(0.2),
nn.Linear(256, 1),
)
def forward(self, x):
return self.cls(x).squeeze(-1)
# 2. Download and Load the Checkpoint
repo_id = "YOUR_HF_USERNAME/safe-route-dynaguard" # <-- Replace with your repo name
model_path = hf_hub_download(repo_id=repo_id, filename="model.pt")
device = "cuda" if torch.cuda.is_available() else "cpu"
router = RouterMLP(input_dim=2048).to(device)
ckpt = torch.load(model_path, map_location=device)
router.load_state_dict(ckpt["state_dict"], strict=False)
router.eval()
# 3. Perform Routing Inference
with torch.no_grad():
# Example feature tensor extracted from small model
sample_features = torch.randn(4, 2048, device=device)
logits = router(sample_features)
routing_probs = torch.sigmoid(logits)
# Use recommended threshold 0.6
decisions = (routing_probs > 0.6).long()
for i, decision in enumerate(decisions):
if decision == 1:
print(f"Sample {i}: Route to LARGE Model (Hard/Unsafe)")
else:
print(f"Sample {i}: Use SMALL Model (Easy/Safe)")
```
## Intended Use
- **Primary Use Case:** Guardrail optimization in LLM serving pipelines.
- **Out-of-Scope:** Standalone toxicity classification directly from raw text (this model requires intermediate hidden feature representations from a pre-trained small safety model).