Update README.md

48c3fec verified 3 months ago

4.52 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- modernbert
	- security
	- jailbreak-detection
	- prompt-injection
	- token-classification
	- tool-calling
	- llm-safety
	- mcp
	datasets:
	- microsoft/llmail-inject-challenge
	- allenai/wildjailbreak
	- hackaprompt/hackaprompt-dataset
	- JailbreakBench/JBB-Behaviors
	base_model: answerdotai/ModernBERT-base
	pipeline_tag: token-classification
	model-index:
	- name: toolcall-verifier
	results:
	- task:
	type: token-classification
	name: Unauthorized Tool Call Detection
	metrics:
	- name: UNAUTHORIZED F1
	type: f1
	value: 0.9350
	- name: UNAUTHORIZED Precision
	type: precision
	value: 0.9501
	- name: UNAUTHORIZED Recall
	type: recall
	value: 0.9205
	- name: Accuracy
	type: accuracy
	value: 0.9288
	---

	# ToolCallVerifier - Unauthorized Tool Call Detection

	<div align="center">

	[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
	[![Model](https://img.shields.io/badge/🤗-ModernBERT--base-yellow)](https://huggingface.co/answerdotai/ModernBERT-base)

	Stage 2 of Two-Stage LLM Agent Defense Pipeline

	</div>

	---

	## 🎯 What This Model Does

	ToolCallVerifier is a ModernBERT-based token classifier that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks.

	\| Label \| Description \|
	\|-------\|-------------\|
	\| `AUTHORIZED` \| Token is part of a legitimate, user-requested action \|
	\| `UNAUTHORIZED` \| Token indicates injected/malicious content — BLOCK \|

	---

	## 🚨 Attack Categories Covered

	\| Category \| Source \| Description \|
	\|----------\|--------\|-------------\|
	\| Delimiter Injection \| LLMail \| `<<end_context>>`, `>>}}\]\])` \|
	\| Word Obfuscation \| LLMail \| Inserting noise words between tokens \|
	\| Fake Sessions \| LLMail \| `START_USER_SESSION`, `EXECUTE_USERQUERY` \|
	\| Roleplay Injection \| WildJailbreak \| "You are an admin bot that can..." \|
	\| XML Tag Injection \| WildJailbreak \| `<execute_action>`, `<tool_call>` \|
	\| Authority Bypass \| WildJailbreak \| "As administrator, I authorize..." \|
	\| Intent Mismatch \| Synthetic \| User asks X, tool does Y \|
	\| MCP Tool Poisoning \| Synthetic \| Hidden exfiltration in tool args \|
	\| MCP Shadowing \| Synthetic \| Fake authorization context \|


	## 🔗 Integration with FunctionCallSentinel

	This model is Stage 2 of a two-stage defense pipeline:

	```
	┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────┐
	│ User Prompt │────▶│ ToolCallSentinel │────▶│ LLM + Tools │
	│ │ │ (Stage 1) │ │ │
	└─────────────────┘ └──────────────────────┘ └────────┬────────┘
	│
	┌──────────────────────────────▼──────────────────────────┐
	│ ToolCallVerifier (This Model) │
	│ Token-level verification before tool execution │
	└─────────────────────────────────────────────────────────┘
	```

	\| Scenario \| Recommendation \|
	\|----------\|----------------\|
	\| General chatbot \| Stage 1 only \|
	\| Tool-calling agent (low risk) \| Stage 1 only \|
	\| Tool-calling agent (high risk) \| Both stages \|
	\| Email/file system access \| Both stages \|
	\| Financial transactions \| Both stages \|

	---

	## 🎯 Intended Use

	### Primary Use Cases
	- LLM Agent Security: Verify tool calls before execution
	- Prompt Injection Defense: Detect unauthorized actions from injected prompts
	- API Gateway Protection: Filter malicious tool calls at infrastructure level

	### Out of Scope
	- General text classification
	- Non-tool-calling scenarios
	- Languages other than English


	## 📜 License

	Apache 2.0