| --- |
| language: |
| - en |
| - vi |
| tags: |
| - safety |
| - guardrail |
| - routing |
| - pytorch |
| - tabular-classification |
| metrics: |
| - f1 |
| - accuracy |
| - precision |
| - recall |
| --- |
| |
| # SafeRoute Router Model (DynaGuard 1.7B / 8B) |
|
|
| This repository contains the weights for the **SafeRoute Router**, an optimized neural router designed to dynamically direct input prompts/responses between a lightweight safety classifier (Small Model) and a high-capacity safety classifier (Large Model). |
|
|
| By routing "easy/safe" queries to the small model and reserving the large model only for "hard/unsafe" queries, the system drastically reduces inference latency and computational cost while preserving overall safety evaluation performance. |
|
|
| ## Model Details |
|
|
| - **Architecture:** Multi-Layer Perceptron (MLP) with 3 hidden layers (`1024 -> 512 -> 256`), utilizing `BatchNorm1d`, `GELU` activations, and moderate `Dropout` (0.3). |
| - **Input Dimension:** `2048` (feature embeddings extracted from the small safety model). |
| - **Output Dimension:** `1` (binary classification logit indicating routing probability). |
| - **Loss Function:** `Focal Loss` ($\alpha=0.75, \gamma=2.0$) tailored to address severe class imbalance. |
| - **Optimizer & Scheduler:** `AdamW` with `CosineAnnealingWarmRestarts`. |
|
|
| ## Evaluation Results |
|
|
| Evaluated on a balanced Test Benchmark at the optimal decision threshold (**0.6**): |
|
|
| | Metric | Score | |
| | :--- | :---: | |
| | **F1 Score** | **0.7525** | |
| | **Accuracy** | **0.7500** | |
| | **Precision** | **0.7451** | |
| | **Recall** | **0.7600** | |
| | **Overall AUPRC** | **0.7588** | |
|
|
| *Note: The high recall (0.76) combined with solid precision (0.74) ensures that potentially unsafe or ambiguous prompts are reliably intercepted and routed to the Large Model for thorough inspection.* |
|
|
| ## How to Get Started with the Model |
|
|
| You can easily download and use this model in your PyTorch pipeline: |
|
|
| ```python |
| import torch |
| import torch.nn as nn |
| from huggingface_hub import hf_hub_download |
| |
| # 1. Define the Router Architecture |
| class RouterMLP(nn.Module): |
| def __init__(self, input_dim=2048): |
| super().__init__() |
| self.cls = nn.Sequential( |
| nn.Linear(input_dim, 1024), |
| nn.BatchNorm1d(1024), |
| nn.GELU(), |
| nn.Dropout(0.3), |
| nn.Linear(1024, 512), |
| nn.BatchNorm1d(512), |
| nn.GELU(), |
| nn.Dropout(0.3), |
| nn.Linear(512, 256), |
| nn.BatchNorm1d(256), |
| nn.GELU(), |
| nn.Dropout(0.2), |
| nn.Linear(256, 1), |
| ) |
| |
| def forward(self, x): |
| return self.cls(x).squeeze(-1) |
| |
| # 2. Download and Load the Checkpoint |
| repo_id = "YOUR_HF_USERNAME/safe-route-dynaguard" # <-- Replace with your repo name |
| model_path = hf_hub_download(repo_id=repo_id, filename="model.pt") |
| |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| router = RouterMLP(input_dim=2048).to(device) |
| |
| ckpt = torch.load(model_path, map_location=device) |
| router.load_state_dict(ckpt["state_dict"], strict=False) |
| router.eval() |
| |
| # 3. Perform Routing Inference |
| with torch.no_grad(): |
| # Example feature tensor extracted from small model |
| sample_features = torch.randn(4, 2048, device=device) |
| |
| logits = router(sample_features) |
| routing_probs = torch.sigmoid(logits) |
| |
| # Use recommended threshold 0.6 |
| decisions = (routing_probs > 0.6).long() |
| |
| for i, decision in enumerate(decisions): |
| if decision == 1: |
| print(f"Sample {i}: Route to LARGE Model (Hard/Unsafe)") |
| else: |
| print(f"Sample {i}: Use SMALL Model (Easy/Safe)") |
| ``` |
|
|
| ## Intended Use |
|
|
| - **Primary Use Case:** Guardrail optimization in LLM serving pipelines. |
| - **Out-of-Scope:** Standalone toxicity classification directly from raw text (this model requires intermediate hidden feature representations from a pre-trained small safety model). |
|
|