brijeshvadi commited on
Commit
4f84aa7
·
verified ·
1 Parent(s): 53cc896

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +91 -0
README.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - text-classification
7
+ - mcp
8
+ - tool-calling
9
+ - qa-testing
10
+ - grok
11
+ - error-detection
12
+ datasets:
13
+ - brijeshvadi/mcp-tool-calling-benchmark
14
+ metrics:
15
+ - accuracy
16
+ - f1
17
+ pipeline_tag: text-classification
18
+ model-index:
19
+ - name: mcp-error-classifier
20
+ results:
21
+ - task:
22
+ type: text-classification
23
+ name: MCP Error Classification
24
+ metrics:
25
+ - name: Accuracy
26
+ type: accuracy
27
+ value: 0.923
28
+ - name: F1
29
+ type: f1
30
+ value: 0.891
31
+ ---
32
+
33
+ # MCP Error Classifier
34
+
35
+ A fine-tuned text classification model that detects and categorizes MCP (Model Context Protocol) tool-calling errors in AI assistant responses.
36
+
37
+ ## Model Description
38
+
39
+ This model classifies AI assistant tool-calling behavior into 5 error categories identified during QA testing of Grok's MCP connector integrations:
40
+
41
+ | Label | Description | Training Samples |
42
+ |-------|-------------|-----------------|
43
+ | `CORRECT` | Tool invoked correctly with proper parameters | 2,847 |
44
+ | `TOOL_BYPASS` | Model answered from training data instead of invoking the tool | 1,203 |
45
+ | `FALSE_SUCCESS` | Model claimed success but tool was never called | 892 |
46
+ | `HALLUCINATION` | Model fabricated tool response data | 756 |
47
+ | `BROKEN_CHAIN` | Multi-step workflow failed mid-chain | 441 |
48
+ | `STALE_DATA` | Tool called but returned outdated cached results | 312 |
49
+
50
+ ## Training Details
51
+
52
+ - **Base Model:** `distilbert-base-uncased`
53
+ - **Training Data:** 6,451 labeled MCP interaction logs across 12 platforms
54
+ - **Platforms Tested:** Supabase, Notion, Miro, Vercel, Netlify, Canva, Linear, GitHub, Box, Slack, Google Drive, Jotform
55
+ - **Epochs:** 5
56
+ - **Learning Rate:** 2e-5
57
+ - **Batch Size:** 32
58
+
59
+ ## Usage
60
+
61
+ ```python
62
+ from transformers import pipeline
63
+
64
+ classifier = pipeline("text-classification", model="brijeshvadi/mcp-error-classifier")
65
+
66
+ result = classifier("Grok responded with project details but never called the Supabase list_projects tool")
67
+ # Output: [{'label': 'TOOL_BYPASS', 'score': 0.94}]
68
+ ```
69
+
70
+ ## Intended Use
71
+
72
+ - QA evaluation of AI assistants' MCP tool-calling reliability
73
+ - Automated error categorization in MCP testing pipelines
74
+ - Benchmarking tool-use accuracy across different LLM providers
75
+
76
+ ## Limitations
77
+
78
+ - Trained primarily on Grok interaction logs; may underperform on Claude/ChatGPT patterns
79
+ - English only
80
+ - Requires context about which tool was expected vs. what was called
81
+
82
+ ## Citation
83
+
84
+ ```bibtex
85
+ @misc{mcp-error-classifier-2026,
86
+ author = {Brijesh Vadi},
87
+ title = {MCP Error Classifier: Detecting Tool-Calling Failures in AI Assistants},
88
+ year = {2026},
89
+ publisher = {Hugging Face},
90
+ }
91
+ ```