File size: 3,828 Bytes

91a7101

---
language:
- en
- vi
tags:
- safety
- guardrail
- routing
- pytorch
- tabular-classification
metrics:
- f1
- accuracy
- precision
- recall
---

# SafeRoute Router Model (DynaGuard 1.7B / 8B)

This repository contains the weights for the **SafeRoute Router**, an optimized neural router designed to dynamically direct input prompts/responses between a lightweight safety classifier (Small Model) and a high-capacity safety classifier (Large Model). 

By routing "easy/safe" queries to the small model and reserving the large model only for "hard/unsafe" queries, the system drastically reduces inference latency and computational cost while preserving overall safety evaluation performance.

## Model Details

- **Architecture:** Multi-Layer Perceptron (MLP) with 3 hidden layers (`1024 -> 512 -> 256`), utilizing `BatchNorm1d`, `GELU` activations, and moderate `Dropout` (0.3).
- **Input Dimension:** `2048` (feature embeddings extracted from the small safety model).
- **Output Dimension:** `1` (binary classification logit indicating routing probability).
- **Loss Function:** `Focal Loss` ($\alpha=0.75, \gamma=2.0$) tailored to address severe class imbalance.
- **Optimizer & Scheduler:** `AdamW` with `CosineAnnealingWarmRestarts`.

## Evaluation Results

Evaluated on a balanced Test Benchmark at the optimal decision threshold (**0.6**):

| Metric | Score |
| :--- | :---: |
| **F1 Score** | **0.7525** |
| **Accuracy** | **0.7500** |
| **Precision** | **0.7451** |
| **Recall** | **0.7600** |
| **Overall AUPRC** | **0.7588** |

*Note: The high recall (0.76) combined with solid precision (0.74) ensures that potentially unsafe or ambiguous prompts are reliably intercepted and routed to the Large Model for thorough inspection.*

## How to Get Started with the Model

You can easily download and use this model in your PyTorch pipeline:

```python
import torch
import torch.nn as nn
from huggingface_hub import hf_hub_download

# 1. Define the Router Architecture
class RouterMLP(nn.Module):
    def __init__(self, input_dim=2048):
        super().__init__()
        self.cls = nn.Sequential(
            nn.Linear(input_dim, 1024),
            nn.BatchNorm1d(1024),
            nn.GELU(),
            nn.Dropout(0.3),
            nn.Linear(1024, 512),
            nn.BatchNorm1d(512),
            nn.GELU(),
            nn.Dropout(0.3),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.GELU(),
            nn.Dropout(0.2),
            nn.Linear(256, 1),
        )

    def forward(self, x):
        return self.cls(x).squeeze(-1)

# 2. Download and Load the Checkpoint
repo_id = "YOUR_HF_USERNAME/safe-route-dynaguard" # <-- Replace with your repo name
model_path = hf_hub_download(repo_id=repo_id, filename="model.pt")

device = "cuda" if torch.cuda.is_available() else "cpu"
router = RouterMLP(input_dim=2048).to(device)

ckpt = torch.load(model_path, map_location=device)
router.load_state_dict(ckpt["state_dict"], strict=False)
router.eval()

# 3. Perform Routing Inference
with torch.no_grad():
    # Example feature tensor extracted from small model
    sample_features = torch.randn(4, 2048, device=device)
    
    logits = router(sample_features)
    routing_probs = torch.sigmoid(logits)
    
    # Use recommended threshold 0.6
    decisions = (routing_probs > 0.6).long()
    
    for i, decision in enumerate(decisions):
        if decision == 1:
            print(f"Sample {i}: Route to LARGE Model (Hard/Unsafe)")
        else:
            print(f"Sample {i}: Use SMALL Model (Easy/Safe)")
```

## Intended Use

- **Primary Use Case:** Guardrail optimization in LLM serving pipelines.
- **Out-of-Scope:** Standalone toxicity classification directly from raw text (this model requires intermediate hidden feature representations from a pre-trained small safety model).