AgentGuard Decision DistilBERT

This model is the four-way decision classifier used by AgentGuard, a policy-aware AI agent guardrail for evaluating tool execution requests.

It predicts one of:

  • ALLOW
  • BLOCK
  • ESCALATE_TO_HUMAN
  • REWRITE_REQUIRED

The model consumes a formatted input containing:

  1. User request
  2. Proposed tool
  3. Tool arguments
  4. Policy context

Related Demo


Intended Use

This model is intended for research, portfolio demonstration, and prototyping of policy-aware guardrails for LLM/AI agent tool execution.

Example use cases:

  • Decide whether a proposed agent action should execute
  • Block clear policy violations
  • Escalate high-impact actions for human review
  • Mark vague or insufficiently grounded requests as rewrite-required

This model should not be used as a sole production authorization or compliance mechanism.


Label Semantics

Label Meaning
ALLOW Safe and policy-compliant
BLOCK Unsafe, prohibited, or malicious
ESCALATE_TO_HUMAN High-impact, approval-gated, or human-review required
REWRITE_REQUIRED Request may be legitimate but is too vague or not safely executable

Evaluation

The model was evaluated on a held-out test split from a 2,550-example synthetic enterprise policy/tool dataset.

Metric Score
Accuracy 0.8300
Macro F1 0.8243
Weighted F1 0.8269

Per-Class F1

Label F1
ALLOW 0.99
BLOCK 0.78
ESCALATE_TO_HUMAN 0.86
REWRITE_REQUIRED 0.66

Baseline Comparison

Model Accuracy Macro F1 Weighted F1
TF-IDF + Logistic Regression 0.8000 0.7960 0.7987
AgentGuard Decision DistilBERT 0.8300 0.8243 0.8269

Training Data

The training dataset was a custom synthetic policy/tool execution corpus built specifically for AgentGuard.

Dataset characteristics

  • 2,550 examples
  • Split:
    • 1,784 train
    • 383 validation
    • 383 test
  • Enterprise domains:
    • Finance
    • HR
    • IT
    • Support
    • Legal
    • Sales
    • Healthcare
    • Banking
    • Security
    • Procurement
    • Marketing

Data creation process

  1. Hand-curated seed examples
  2. LLM-assisted synthetic expansion
  3. Schema validation and de-duplication
  4. Targeted repair for underrepresented risk scenarios
  5. Final split regeneration and retraining

Model Details

Attribute Value
Base model distilbert-base-uncased
Architecture Sequence classification
Number of labels 4
Language English
Library Transformers

Example Input

USER_REQUEST:
Refund $8,500 to customer ACCT-9012 due to a billing dispute.

PROPOSED_TOOL:
issue_refund

TOOL_ARGUMENTS:
{"amount": 8500, "customer_id": "ACCT-9012"}

POLICY_CONTEXT:
Refunds above $1,000 require manager approval.

Example Output

ESCALATE_TO_HUMAN

Limitations

  • Trained on synthetic, not audited production, policy/tool data.
  • Should not be used as a substitute for hard authorization, IAM, or compliance checks.
  • REWRITE_REQUIRED is the most semantically ambiguous class and has the lowest class-level F1.
  • Performance outside the demonstrated policy/tool format has not been established.

Recommended Deployment Pattern

Use this model together with the companion AgentGuard risk classifier and a deterministic decision layer that applies:

  • confidence thresholds,
  • human escalation rules,
  • high-risk overrides,
  • and downstream authorization enforcement.

Citation / Attribution

This model was built as part of the AgentGuard project by Taran Patel.

Downloads last month
39
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for taran1812/agentguard-decision-distilbert

Finetuned
(11639)
this model

Space using taran1812/agentguard-decision-distilbert 1