| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - text-classification |
| - mcp |
| - tool-calling |
| - qa-testing |
| - grok |
| - error-detection |
| datasets: |
| - brijeshvadi/mcp-tool-calling-benchmark |
| metrics: |
| - accuracy |
| - f1 |
| pipeline_tag: text-classification |
| model-index: |
| - name: mcp-error-classifier |
| results: |
| - task: |
| type: text-classification |
| name: MCP Error Classification |
| metrics: |
| - name: Accuracy |
| type: accuracy |
| value: 0.923 |
| - name: F1 |
| type: f1 |
| value: 0.891 |
| --- |
| |
| # MCP Error Classifier |
|
|
| A fine-tuned text classification model that detects and categorizes MCP (Model Context Protocol) tool-calling errors in AI assistant responses. |
|
|
| ## Model Description |
|
|
| This model classifies AI assistant tool-calling behavior into 5 error categories identified during QA testing of Grok's MCP connector integrations: |
|
|
| | Label | Description | Training Samples | |
| |-------|-------------|-----------------| |
| | `CORRECT` | Tool invoked correctly with proper parameters | 2,847 | |
| | `TOOL_BYPASS` | Model answered from training data instead of invoking the tool | 1,203 | |
| | `FALSE_SUCCESS` | Model claimed success but tool was never called | 892 | |
| | `HALLUCINATION` | Model fabricated tool response data | 756 | |
| | `BROKEN_CHAIN` | Multi-step workflow failed mid-chain | 441 | |
| | `STALE_DATA` | Tool called but returned outdated cached results | 312 | |
|
|
| ## Training Details |
|
|
| - **Base Model:** `distilbert-base-uncased` |
| - **Training Data:** 6,451 labeled MCP interaction logs across 12 platforms |
| - **Platforms Tested:** Supabase, Notion, Miro, Vercel, Netlify, Canva, Linear, GitHub, Box, Slack, Google Drive, Jotform |
| - **Epochs:** 5 |
| - **Learning Rate:** 2e-5 |
| - **Batch Size:** 32 |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import pipeline |
| |
| classifier = pipeline("text-classification", model="brijeshvadi/mcp-error-classifier") |
| |
| result = classifier("Grok responded with project details but never called the Supabase list_projects tool") |
| # Output: [{'label': 'TOOL_BYPASS', 'score': 0.94}] |
| ``` |
|
|
| ## Intended Use |
|
|
| - QA evaluation of AI assistants' MCP tool-calling reliability |
| - Automated error categorization in MCP testing pipelines |
| - Benchmarking tool-use accuracy across different LLM providers |
|
|
| ## Limitations |
|
|
| - Trained primarily on Grok interaction logs; may underperform on Claude/ChatGPT patterns |
| - English only |
| - Requires context about which tool was expected vs. what was called |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{mcp-error-classifier-2026, |
| author = {Brijesh Vadi}, |
| title = {MCP Error Classifier: Detecting Tool-Calling Failures in AI Assistants}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| } |
| ``` |
|
|