StevenMup2004

Upload folder using huggingface_hub

91a7101 verified 2 days ago

3.83 kB

	---
	language:
	- en
	- vi
	tags:
	- safety
	- guardrail
	- routing
	- pytorch
	- tabular-classification
	metrics:
	- f1
	- accuracy
	- precision
	- recall
	---

	# SafeRoute Router Model (DynaGuard 1.7B / 8B)

	This repository contains the weights for the SafeRoute Router, an optimized neural router designed to dynamically direct input prompts/responses between a lightweight safety classifier (Small Model) and a high-capacity safety classifier (Large Model).

	By routing "easy/safe" queries to the small model and reserving the large model only for "hard/unsafe" queries, the system drastically reduces inference latency and computational cost while preserving overall safety evaluation performance.

	## Model Details

	- Architecture: Multi-Layer Perceptron (MLP) with 3 hidden layers (`1024 -> 512 -> 256`), utilizing `BatchNorm1d`, `GELU` activations, and moderate `Dropout` (0.3).
	- Input Dimension: `2048` (feature embeddings extracted from the small safety model).
	- Output Dimension: `1` (binary classification logit indicating routing probability).
	- Loss Function: `Focal Loss` ($\alpha=0.75, \gamma=2.0$) tailored to address severe class imbalance.
	- Optimizer & Scheduler: `AdamW` with `CosineAnnealingWarmRestarts`.

	## Evaluation Results

	Evaluated on a balanced Test Benchmark at the optimal decision threshold (0.6):

	\| Metric \| Score \|
	\| :--- \| :---: \|
	\| F1 Score \| 0.7525 \|
	\| Accuracy \| 0.7500 \|
	\| Precision \| 0.7451 \|
	\| Recall \| 0.7600 \|
	\| Overall AUPRC \| 0.7588 \|

	Note: The high recall (0.76) combined with solid precision (0.74) ensures that potentially unsafe or ambiguous prompts are reliably intercepted and routed to the Large Model for thorough inspection.

	## How to Get Started with the Model

	You can easily download and use this model in your PyTorch pipeline:

	```python
	import torch
	import torch.nn as nn
	from huggingface_hub import hf_hub_download

	# 1. Define the Router Architecture
	class RouterMLP(nn.Module):
	def __init__(self, input_dim=2048):
	super().__init__()
	self.cls = nn.Sequential(
	nn.Linear(input_dim, 1024),
	nn.BatchNorm1d(1024),
	nn.GELU(),
	nn.Dropout(0.3),
	nn.Linear(1024, 512),
	nn.BatchNorm1d(512),
	nn.GELU(),
	nn.Dropout(0.3),
	nn.Linear(512, 256),
	nn.BatchNorm1d(256),
	nn.GELU(),
	nn.Dropout(0.2),
	nn.Linear(256, 1),
	)

	def forward(self, x):
	return self.cls(x).squeeze(-1)

	# 2. Download and Load the Checkpoint
	repo_id = "YOUR_HF_USERNAME/safe-route-dynaguard" # <-- Replace with your repo name
	model_path = hf_hub_download(repo_id=repo_id, filename="model.pt")

	device = "cuda" if torch.cuda.is_available() else "cpu"
	router = RouterMLP(input_dim=2048).to(device)

	ckpt = torch.load(model_path, map_location=device)
	router.load_state_dict(ckpt["state_dict"], strict=False)
	router.eval()

	# 3. Perform Routing Inference
	with torch.no_grad():
	# Example feature tensor extracted from small model
	sample_features = torch.randn(4, 2048, device=device)

	logits = router(sample_features)
	routing_probs = torch.sigmoid(logits)

	# Use recommended threshold 0.6
	decisions = (routing_probs > 0.6).long()

	for i, decision in enumerate(decisions):
	if decision == 1:
	print(f"Sample {i}: Route to LARGE Model (Hard/Unsafe)")
	else:
	print(f"Sample {i}: Use SMALL Model (Easy/Safe)")
	```

	## Intended Use

	- Primary Use Case: Guardrail optimization in LLM serving pipelines.
	- Out-of-Scope: Standalone toxicity classification directly from raw text (this model requires intermediate hidden feature representations from a pre-trained small safety model).