Instructions to use taran1812/agentguard-decision-distilbert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use taran1812/agentguard-decision-distilbert with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="taran1812/agentguard-decision-distilbert")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("taran1812/agentguard-decision-distilbert") model = AutoModelForSequenceClassification.from_pretrained("taran1812/agentguard-decision-distilbert") - Notebooks
- Google Colab
- Kaggle
AgentGuard Decision DistilBERT
This model is the four-way decision classifier used by AgentGuard, a policy-aware AI agent guardrail for evaluating tool execution requests.
It predicts one of:
ALLOWBLOCKESCALATE_TO_HUMANREWRITE_REQUIRED
The model consumes a formatted input containing:
- User request
- Proposed tool
- Tool arguments
- Policy context
Related Demo
- Live Space: https://huggingface.co/spaces/taran1812/agentguard-demo
- Companion risk model: https://huggingface.co/taran1812/agentguard-risk-distilbert
Intended Use
This model is intended for research, portfolio demonstration, and prototyping of policy-aware guardrails for LLM/AI agent tool execution.
Example use cases:
- Decide whether a proposed agent action should execute
- Block clear policy violations
- Escalate high-impact actions for human review
- Mark vague or insufficiently grounded requests as rewrite-required
This model should not be used as a sole production authorization or compliance mechanism.
Label Semantics
| Label | Meaning |
|---|---|
ALLOW |
Safe and policy-compliant |
BLOCK |
Unsafe, prohibited, or malicious |
ESCALATE_TO_HUMAN |
High-impact, approval-gated, or human-review required |
REWRITE_REQUIRED |
Request may be legitimate but is too vague or not safely executable |
Evaluation
The model was evaluated on a held-out test split from a 2,550-example synthetic enterprise policy/tool dataset.
| Metric | Score |
|---|---|
| Accuracy | 0.8300 |
| Macro F1 | 0.8243 |
| Weighted F1 | 0.8269 |
Per-Class F1
| Label | F1 |
|---|---|
ALLOW |
0.99 |
BLOCK |
0.78 |
ESCALATE_TO_HUMAN |
0.86 |
REWRITE_REQUIRED |
0.66 |
Baseline Comparison
| Model | Accuracy | Macro F1 | Weighted F1 |
|---|---|---|---|
| TF-IDF + Logistic Regression | 0.8000 | 0.7960 | 0.7987 |
| AgentGuard Decision DistilBERT | 0.8300 | 0.8243 | 0.8269 |
Training Data
The training dataset was a custom synthetic policy/tool execution corpus built specifically for AgentGuard.
Dataset characteristics
- 2,550 examples
- Split:
- 1,784 train
- 383 validation
- 383 test
- Enterprise domains:
- Finance
- HR
- IT
- Support
- Legal
- Sales
- Healthcare
- Banking
- Security
- Procurement
- Marketing
Data creation process
- Hand-curated seed examples
- LLM-assisted synthetic expansion
- Schema validation and de-duplication
- Targeted repair for underrepresented risk scenarios
- Final split regeneration and retraining
Model Details
| Attribute | Value |
|---|---|
| Base model | distilbert-base-uncased |
| Architecture | Sequence classification |
| Number of labels | 4 |
| Language | English |
| Library | Transformers |
Example Input
USER_REQUEST:
Refund $8,500 to customer ACCT-9012 due to a billing dispute.
PROPOSED_TOOL:
issue_refund
TOOL_ARGUMENTS:
{"amount": 8500, "customer_id": "ACCT-9012"}
POLICY_CONTEXT:
Refunds above $1,000 require manager approval.
Example Output
ESCALATE_TO_HUMAN
Limitations
- Trained on synthetic, not audited production, policy/tool data.
- Should not be used as a substitute for hard authorization, IAM, or compliance checks.
REWRITE_REQUIREDis the most semantically ambiguous class and has the lowest class-level F1.- Performance outside the demonstrated policy/tool format has not been established.
Recommended Deployment Pattern
Use this model together with the companion AgentGuard risk classifier and a deterministic decision layer that applies:
- confidence thresholds,
- human escalation rules,
- high-risk overrides,
- and downstream authorization enforcement.
Citation / Attribution
This model was built as part of the AgentGuard project by Taran Patel.
- Downloads last month
- 39
Model tree for taran1812/agentguard-decision-distilbert
Base model
distilbert/distilbert-base-uncased