--- license: mit language: - en tags: - text-classification - mcp - tool-calling - qa-testing - grok - error-detection datasets: - brijeshvadi/mcp-tool-calling-benchmark metrics: - accuracy - f1 pipeline_tag: text-classification model-index: - name: mcp-error-classifier results: - task: type: text-classification name: MCP Error Classification metrics: - name: Accuracy type: accuracy value: 0.923 - name: F1 type: f1 value: 0.891 --- # MCP Error Classifier A fine-tuned text classification model that detects and categorizes MCP (Model Context Protocol) tool-calling errors in AI assistant responses. ## Model Description This model classifies AI assistant tool-calling behavior into 5 error categories identified during QA testing of Grok's MCP connector integrations: | Label | Description | Training Samples | |-------|-------------|-----------------| | `CORRECT` | Tool invoked correctly with proper parameters | 2,847 | | `TOOL_BYPASS` | Model answered from training data instead of invoking the tool | 1,203 | | `FALSE_SUCCESS` | Model claimed success but tool was never called | 892 | | `HALLUCINATION` | Model fabricated tool response data | 756 | | `BROKEN_CHAIN` | Multi-step workflow failed mid-chain | 441 | | `STALE_DATA` | Tool called but returned outdated cached results | 312 | ## Training Details - **Base Model:** `distilbert-base-uncased` - **Training Data:** 6,451 labeled MCP interaction logs across 12 platforms - **Platforms Tested:** Supabase, Notion, Miro, Vercel, Netlify, Canva, Linear, GitHub, Box, Slack, Google Drive, Jotform - **Epochs:** 5 - **Learning Rate:** 2e-5 - **Batch Size:** 32 ## Usage ```python from transformers import pipeline classifier = pipeline("text-classification", model="brijeshvadi/mcp-error-classifier") result = classifier("Grok responded with project details but never called the Supabase list_projects tool") # Output: [{'label': 'TOOL_BYPASS', 'score': 0.94}] ``` ## Intended Use - QA evaluation of AI assistants' MCP tool-calling reliability - Automated error categorization in MCP testing pipelines - Benchmarking tool-use accuracy across different LLM providers ## Limitations - Trained primarily on Grok interaction logs; may underperform on Claude/ChatGPT patterns - English only - Requires context about which tool was expected vs. what was called ## Citation ```bibtex @misc{mcp-error-classifier-2026, author = {Brijesh Vadi}, title = {MCP Error Classifier: Detecting Tool-Calling Failures in AI Assistants}, year = {2026}, publisher = {Hugging Face}, } ```