AgentGuard Decision DistilBERT

This model is the four-way decision classifier used by AgentGuard, a policy-aware AI agent guardrail for evaluating tool execution requests.

It predicts one of:

ALLOW
BLOCK
ESCALATE_TO_HUMAN
REWRITE_REQUIRED

The model consumes a formatted input containing:

User request
Proposed tool
Tool arguments
Policy context

Related Demo

Live Space: https://huggingface.co/spaces/taran1812/agentguard-demo
Companion risk model: https://huggingface.co/taran1812/agentguard-risk-distilbert

Intended Use

This model is intended for research, portfolio demonstration, and prototyping of policy-aware guardrails for LLM/AI agent tool execution.

Example use cases:

Decide whether a proposed agent action should execute
Block clear policy violations
Escalate high-impact actions for human review
Mark vague or insufficiently grounded requests as rewrite-required

This model should not be used as a sole production authorization or compliance mechanism.

Label Semantics

Label	Meaning
`ALLOW`	Safe and policy-compliant
`BLOCK`	Unsafe, prohibited, or malicious
`ESCALATE_TO_HUMAN`	High-impact, approval-gated, or human-review required
`REWRITE_REQUIRED`	Request may be legitimate but is too vague or not safely executable

Evaluation

The model was evaluated on a held-out test split from a 2,550-example synthetic enterprise policy/tool dataset.

Metric	Score
Accuracy	0.8300
Macro F1	0.8243
Weighted F1	0.8269

Per-Class F1

Label	F1
`ALLOW`	0.99
`BLOCK`	0.78
`ESCALATE_TO_HUMAN`	0.86
`REWRITE_REQUIRED`	0.66

Baseline Comparison

Model	Accuracy	Macro F1	Weighted F1
TF-IDF + Logistic Regression	0.8000	0.7960	0.7987
AgentGuard Decision DistilBERT	0.8300	0.8243	0.8269

Training Data

The training dataset was a custom synthetic policy/tool execution corpus built specifically for AgentGuard.

Dataset characteristics

2,550 examples
Split:
- 1,784 train
- 383 validation
- 383 test
Enterprise domains:
- Finance
- HR
- IT
- Support
- Legal
- Sales
- Healthcare
- Banking
- Security
- Procurement
- Marketing

Data creation process

Hand-curated seed examples
LLM-assisted synthetic expansion
Schema validation and de-duplication
Targeted repair for underrepresented risk scenarios
Final split regeneration and retraining

Model Details

Attribute	Value
Base model	`distilbert-base-uncased`
Architecture	Sequence classification
Number of labels	4
Language	English
Library	Transformers

Example Input

USER_REQUEST:
Refund $8,500 to customer ACCT-9012 due to a billing dispute.

PROPOSED_TOOL:
issue_refund

TOOL_ARGUMENTS:
{"amount": 8500, "customer_id": "ACCT-9012"}

POLICY_CONTEXT:
Refunds above $1,000 require manager approval.

Example Output

ESCALATE_TO_HUMAN

Limitations

Trained on synthetic, not audited production, policy/tool data.
Should not be used as a substitute for hard authorization, IAM, or compliance checks.
REWRITE_REQUIRED is the most semantically ambiguous class and has the lowest class-level F1.
Performance outside the demonstrated policy/tool format has not been established.

Recommended Deployment Pattern

Use this model together with the companion AgentGuard risk classifier and a deterministic decision layer that applies:

confidence thresholds,
human escalation rules,
high-risk overrides,
and downstream authorization enforcement.

Citation / Attribution

This model was built as part of the AgentGuard project by Taran Patel.

Downloads last month: 2

Safetensors

Model size

67M params

Tensor type

F32

Model tree for taran1812/agentguard-decision-distilbert

Base model

distilbert/distilbert-base-uncased

Finetuned

(11964)

this model

taran1812
/

agentguard-decision-distilbert