metadata
license: mit
language:
- en
tags:
- text-classification
- mcp
- tool-calling
- qa-testing
- grok
- error-detection
datasets:
- brijeshvadi/mcp-tool-calling-benchmark
metrics:
- accuracy
- f1
pipeline_tag: text-classification
model-index:
- name: mcp-error-classifier
results:
- task:
type: text-classification
name: MCP Error Classification
metrics:
- name: Accuracy
type: accuracy
value: 0.923
- name: F1
type: f1
value: 0.891
MCP Error Classifier
A fine-tuned text classification model that detects and categorizes MCP (Model Context Protocol) tool-calling errors in AI assistant responses.
Model Description
This model classifies AI assistant tool-calling behavior into 5 error categories identified during QA testing of Grok's MCP connector integrations:
| Label | Description | Training Samples |
|---|---|---|
CORRECT |
Tool invoked correctly with proper parameters | 2,847 |
TOOL_BYPASS |
Model answered from training data instead of invoking the tool | 1,203 |
FALSE_SUCCESS |
Model claimed success but tool was never called | 892 |
HALLUCINATION |
Model fabricated tool response data | 756 |
BROKEN_CHAIN |
Multi-step workflow failed mid-chain | 441 |
STALE_DATA |
Tool called but returned outdated cached results | 312 |
Training Details
- Base Model:
distilbert-base-uncased - Training Data: 6,451 labeled MCP interaction logs across 12 platforms
- Platforms Tested: Supabase, Notion, Miro, Vercel, Netlify, Canva, Linear, GitHub, Box, Slack, Google Drive, Jotform
- Epochs: 5
- Learning Rate: 2e-5
- Batch Size: 32
Usage
from transformers import pipeline
classifier = pipeline("text-classification", model="brijeshvadi/mcp-error-classifier")
result = classifier("Grok responded with project details but never called the Supabase list_projects tool")
# Output: [{'label': 'TOOL_BYPASS', 'score': 0.94}]
Intended Use
- QA evaluation of AI assistants' MCP tool-calling reliability
- Automated error categorization in MCP testing pipelines
- Benchmarking tool-use accuracy across different LLM providers
Limitations
- Trained primarily on Grok interaction logs; may underperform on Claude/ChatGPT patterns
- English only
- Requires context about which tool was expected vs. what was called
Citation
@misc{mcp-error-classifier-2026,
author = {Brijesh Vadi},
title = {MCP Error Classifier: Detecting Tool-Calling Failures in AI Assistants},
year = {2026},
publisher = {Hugging Face},
}