| | --- |
| | language: |
| | - en |
| | license: apache-2.0 |
| | library_name: transformers |
| | tags: |
| | - modernbert |
| | - security |
| | - jailbreak-detection |
| | - prompt-injection |
| | - token-classification |
| | - tool-calling |
| | - llm-safety |
| | - mcp |
| | datasets: |
| | - microsoft/llmail-inject-challenge |
| | - allenai/wildjailbreak |
| | - hackaprompt/hackaprompt-dataset |
| | - JailbreakBench/JBB-Behaviors |
| | base_model: answerdotai/ModernBERT-base |
| | pipeline_tag: token-classification |
| | model-index: |
| | - name: toolcall-verifier |
| | results: |
| | - task: |
| | type: token-classification |
| | name: Unauthorized Tool Call Detection |
| | metrics: |
| | - name: UNAUTHORIZED F1 |
| | type: f1 |
| | value: 0.9350 |
| | - name: UNAUTHORIZED Precision |
| | type: precision |
| | value: 0.9501 |
| | - name: UNAUTHORIZED Recall |
| | type: recall |
| | value: 0.9205 |
| | - name: Accuracy |
| | type: accuracy |
| | value: 0.9288 |
| | --- |
| | |
| | # ToolCallVerifier - Unauthorized Tool Call Detection |
| |
|
| | <div align="center"> |
| |
|
| | [](https://opensource.org/licenses/Apache-2.0) |
| | [](https://huggingface.co/answerdotai/ModernBERT-base) |
| |
|
| | **Stage 2 of Two-Stage LLM Agent Defense Pipeline** |
| |
|
| | </div> |
| |
|
| | --- |
| |
|
| | ## π― What This Model Does |
| |
|
| | ToolCallVerifier is a **ModernBERT-based token classifier** that detects unauthorized tool calls in LLM agent systems. It performs token-level classification on tool call JSON to identify malicious arguments that may have been injected through prompt injection attacks. |
| |
|
| | | Label | Description | |
| | |-------|-------------| |
| | | `AUTHORIZED` | Token is part of a legitimate, user-requested action | |
| | | `UNAUTHORIZED` | Token indicates injected/malicious content β **BLOCK** | |
| |
|
| | --- |
| |
|
| | ## π¨ Attack Categories Covered |
| |
|
| | | Category | Source | Description | |
| | |----------|--------|-------------| |
| | | Delimiter Injection | LLMail | `<<end_context>>`, `>>}}\]\])` | |
| | | Word Obfuscation | LLMail | Inserting noise words between tokens | |
| | | Fake Sessions | LLMail | `START_USER_SESSION`, `EXECUTE_USERQUERY` | |
| | | Roleplay Injection | WildJailbreak | "You are an admin bot that can..." | |
| | | XML Tag Injection | WildJailbreak | `<execute_action>`, `<tool_call>` | |
| | | Authority Bypass | WildJailbreak | "As administrator, I authorize..." | |
| | | Intent Mismatch | Synthetic | User asks X, tool does Y | |
| | | MCP Tool Poisoning | Synthetic | Hidden exfiltration in tool args | |
| | | MCP Shadowing | Synthetic | Fake authorization context | |
| |
|
| |
|
| | ## π Integration with FunctionCallSentinel |
| |
|
| | This model is **Stage 2** of a two-stage defense pipeline: |
| |
|
| | ``` |
| | βββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββ |
| | β User Prompt ββββββΆβ ToolCallSentinel ββββββΆβ LLM + Tools β |
| | β β β (Stage 1) β β β |
| | βββββββββββββββββββ ββββββββββββββββββββββββ ββββββββββ¬βββββββββ |
| | β |
| | ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ |
| | β ToolCallVerifier (This Model) β |
| | β Token-level verification before tool execution β |
| | βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| | ``` |
| |
|
| | | Scenario | Recommendation | |
| | |----------|----------------| |
| | | General chatbot | Stage 1 only | |
| | | Tool-calling agent (low risk) | Stage 1 only | |
| | | Tool-calling agent (high risk) | **Both stages** | |
| | | Email/file system access | **Both stages** | |
| | | Financial transactions | **Both stages** | |
| |
|
| | --- |
| |
|
| | ## π― Intended Use |
| |
|
| | ### Primary Use Cases |
| | - **LLM Agent Security**: Verify tool calls before execution |
| | - **Prompt Injection Defense**: Detect unauthorized actions from injected prompts |
| | - **API Gateway Protection**: Filter malicious tool calls at infrastructure level |
| |
|
| | ### Out of Scope |
| | - General text classification |
| | - Non-tool-calling scenarios |
| | - Languages other than English |
| |
|
| |
|
| | ## π License |
| |
|
| | Apache 2.0 |
| |
|
| |
|