Customer Support Ticket Classifier
A fine-tuned DistilBERT model that classifies customer support tickets into 11 issue categories. Designed for automatic routing, triage, and analytics of customer inquiries.
Quick Start
from transformers import pipeline
classifier = pipeline("text-classification", model="Janvi17/customer-support-ticket-classifier")
result = classifier("I want to cancel my subscription immediately")
print(result)
# [{'label': 'SUBSCRIPTION', 'score': 0.997}]
Categories
The model classifies text into one of 11 customer support issue types:
| Label | Description | Example |
|---|---|---|
ACCOUNT |
Account creation, deletion, password, profile changes | "How do I change my account password?" |
CANCEL |
Cancellation fees, policies, contract termination | "What is the fee for canceling the contract?" |
CONTACT |
Reaching customer service, speaking to a human agent | "I want to speak to a human agent" |
DELIVERY |
Delivery options, shipping methods, delivery regions | "Do you ship to Hungary?" |
FEEDBACK |
Reviews, complaints, submitting feedback | "I'd like to leave a review for your services" |
INVOICE |
Viewing, requesting, or locating invoices/bills | "Can you send me the invoice for order #12345?" |
ORDER |
Placing, tracking, modifying, or canceling orders | "I need help cancelling order #55123" |
PAYMENT |
Payment methods, issues, checkout errors | "I get an error when I try to check out" |
REFUND |
Refund requests, refund policy, tracking refunds | "I need a refund for my last purchase" |
SHIPPING |
Shipping address changes, setup, modifications | "I need to update my shipping address" |
SUBSCRIPTION |
Newsletter signup, unsubscribe, subscription management | "Help me unsubscribe from your newsletter" |
Usage Examples
Basic Classification
from transformers import pipeline
classifier = pipeline("text-classification", model="Janvi17/customer-support-ticket-classifier")
tickets = [
"I want to cancel my subscription immediately",
"Where is my package? I've been waiting for 2 weeks",
"I need a refund for my last purchase",
"How do I change my account password?",
"I want to speak to a human agent",
"Can you send me the invoice for order #12345?",
]
for ticket in tickets:
result = classifier(ticket)
print(f" [{result[0]['label']:>12s}] (conf: {result[0]['score']:.3f}) {ticket}")
Output:
[SUBSCRIPTION] (conf: 0.997) I want to cancel my subscription immediately
[ DELIVERY] (conf: 0.997) Where is my package? I've been waiting for 2 weeks
[ REFUND] (conf: 0.999) I need a refund for my last purchase
[ ACCOUNT] (conf: 1.000) How do I change my account password?
[ CONTACT] (conf: 0.999) I want to speak to a human agent
[ INVOICE] (conf: 0.997) Can you send me the invoice for order #12345?
Batch Classification with Confidence Scores
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="Janvi17/customer-support-ticket-classifier",
top_k=3, # return top 3 predictions
)
result = classifier("The payment for my subscription failed")
for pred in result[0]:
print(f" {pred['label']:>14s}: {pred['score']:.4f}")
# PAYMENT: 0.9661
# SUBSCRIPTION: 0.0153
# ORDER: 0.0068
Using with PyTorch Directly
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "Janvi17/customer-support-ticket-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "I need a refund for my last purchase"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1)
pred_id = probs.argmax().item()
print(f"Predicted: {model.config.id2label[pred_id]} ({probs[0][pred_id]:.4f})")
# Predicted: REFUND (0.9995)
Production Usage with Confidence Threshold
For production systems, reject low-confidence predictions:
from transformers import pipeline
classifier = pipeline("text-classification", model="Janvi17/customer-support-ticket-classifier")
CONFIDENCE_THRESHOLD = 0.85
def classify_ticket(text: str) -> dict:
result = classifier(text)[0]
if result["score"] < CONFIDENCE_THRESHOLD:
return {"label": "UNKNOWN", "score": result["score"], "routed_to": "human_review"}
return {"label": result["label"], "score": result["score"], "routed_to": "auto"}
# High-confidence → auto-routed
print(classify_ticket("I need a refund"))
# {'label': 'REFUND', 'score': 0.999, 'routed_to': 'auto'}
# Low-confidence → human review
print(classify_ticket("asdfghjkl"))
# {'label': 'UNKNOWN', 'score': 0.78, 'routed_to': 'human_review'}
Evaluation Results
Held-Out Test Set (2,464 samples)
| Metric | Score |
|---|---|
| Accuracy | 100.00% |
| Macro F1 | 100.00% |
| Weighted F1 | 100.00% |
Per-Class Performance
| Category | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| ACCOUNT | 1.0000 | 1.0000 | 1.0000 | 545 |
| CANCEL | 1.0000 | 1.0000 | 1.0000 | 95 |
| CONTACT | 1.0000 | 1.0000 | 1.0000 | 200 |
| DELIVERY | 1.0000 | 1.0000 | 1.0000 | 166 |
| FEEDBACK | 1.0000 | 1.0000 | 1.0000 | 199 |
| INVOICE | 1.0000 | 1.0000 | 1.0000 | 183 |
| ORDER | 1.0000 | 1.0000 | 1.0000 | 317 |
| PAYMENT | 1.0000 | 1.0000 | 1.0000 | 200 |
| REFUND | 1.0000 | 1.0000 | 1.0000 | 262 |
| SHIPPING | 1.0000 | 1.0000 | 1.0000 | 197 |
| SUBSCRIPTION | 1.0000 | 1.0000 | 1.0000 | 100 |
Confusion Matrix
Perfect diagonal — zero off-diagonal errors on the held-out test set.
Training Trajectory
| Epoch | Train Loss | Val Loss | Val Accuracy | Val Macro F1 |
|---|---|---|---|---|
| 1 | 0.0229 | 0.0163 | 99.76% | 99.78% |
| 2 | 0.0042 | 0.0106 | 99.68% | 99.68% |
| 3 | 0.0024 | 0.0054 | 99.88% | 99.88% ✦ best |
| 4 | 0.0008 | 0.0091 | 99.80% | 99.80% |
| 5 | 0.0007 | 0.0088 | 99.80% | 99.80% |
Best checkpoint selected at epoch 3 (highest validation macro F1). Early stopping was configured with patience=2.
Baselines
| Method | Accuracy | Macro F1 |
|---|---|---|
| Random | 9.9% | 9.1% |
| Majority class | 22.1% | 3.3% |
| TF-IDF + Logistic Regression | 99.7% | 99.7% |
| This model (DistilBERT) | 100.0% | 100.0% |
Training Details
Dataset
Bitext Customer Support LLM Chatbot Training Dataset
- License: CDLA-Sharing-1.0
- Publisher: Bitext
- Size: 26,872 rows → 24,635 after deduplication
- Splits used: 80% train (19,708) / 10% validation (2,463) / 10% test (2,464), stratified
- Language: English
- Format: Synthetic, template-generated customer support messages with intentional typos, case variations, and paraphrasing. Includes
{{placeholder}}tokens for entities like order numbers and names.
The dataset contains 11 high-level issue categories and 27 fine-grained intents. This model classifies at the category level.
Class Distribution (Training Set)
| Category | Count | % |
|---|---|---|
| ACCOUNT | 4,354 | 22.1% |
| ORDER | 2,534 | 12.9% |
| REFUND | 2,098 | 10.6% |
| CONTACT | 1,599 | 8.1% |
| PAYMENT | 1,598 | 8.1% |
| FEEDBACK | 1,598 | 8.1% |
| SHIPPING | 1,576 | 8.0% |
| INVOICE | 1,464 | 7.4% |
| DELIVERY | 1,327 | 6.7% |
| SUBSCRIPTION | 799 | 4.1% |
| CANCEL | 760 | 3.9% |
Imbalance ratio: 5.7× (ACCOUNT vs CANCEL). Despite this, macro F1 is perfect — all classes are well-separated in semantic space.
Preprocessing
- Exact-duplicate removal (26,872 → 24,635 samples)
- Stratified train/val/test split (80/10/10, seed=42)
- Tokenization with
distilbert-base-uncasedtokenizer,max_length=128 - Dynamic padding via
DataCollatorWithPadding
No text was truncated — the longest tokenized input is 32 tokens.
Hyperparameters
| Parameter | Value |
|---|---|
| Base model | distilbert/distilbert-base-uncased (67M params) |
| Learning rate | 2e-5 |
| Batch size | 32 |
| Epochs | 5 (best checkpoint at epoch 3) |
| Weight decay | 0.01 |
| Warmup steps | 308 (10% of total) |
| LR scheduler | Cosine |
| Early stopping | Patience = 2 (metric: macro F1) |
| Precision | fp32 (trained on CPU) |
| Seed | 42 |
Framework Versions
- Transformers 5.7.0
- PyTorch 2.11.0
- Datasets 4.8.5
- Tokenizers 0.22.2
Demo
🚀 Try the live Gradio demo — paste any support ticket and see real-time classification with confidence breakdown.
Limitations and Risks
Known Failure Modes
This model was stress-tested with 28 adversarial inputs. Four systematic weaknesses were identified:
1. Negation Blindness 🔴
The model ignores negation. "Don't refund me, just fix the product" is classified as REFUND (99.95% confidence). The training data contains no negated intents, and DistilBERT's 6-layer architecture has limited compositional reasoning.
Mitigation: Add negated intent examples to training data, or post-process with a negation detector.
2. No Out-of-Distribution Rejection 🔴
The model assigns a label to any input, including gibberish, empty strings, and unrelated text. Examples:
| Input | Prediction | Confidence |
|---|---|---|
"asdfghjkl" |
ORDER | 78.1% |
"" (empty) |
ORDER | 73.7% |
"The quick brown fox..." |
CONTACT | 84.9% |
Mitigation: Use a confidence threshold (recommended: 0.85) to reject uncertain predictions. See the production usage example above.
3. Heavy Typo Fragility 🟡
While the model handles mild typos well (the training data includes ~34% typo-augmented samples), severely misspelled text can cause misclassification:
| Input | Expected | Predicted | Confidence |
|---|---|---|---|
| "hwere is my pakage" | DELIVERY | DELIVERY | 48.1% ⚠️ |
| "I wnat to spek to a humna" | CONTACT | ORDER | 81.8% ❌ |
Mitigation: Add a spell-correction preprocessing step, or augment training data with heavier typo injection.
4. Single-Label on Multi-Intent Tickets 🟡
Real support tickets often span multiple categories. The model picks one:
| Input | Predicted | Also relevant |
|---|---|---|
| "Cancel my subscription and give me a refund" | REFUND | CANCEL, SUBSCRIPTION |
| "Your delivery is terrible, I want to complain" | DELIVERY | FEEDBACK |
Mitigation: For full coverage, return top-k predictions or switch to multi-label classification.
Dataset Limitations
- Synthetic data: The training set is template-generated, not sourced from real customer interactions. Real-world text may contain slang, code-switching, or domain-specific jargon not represented in training.
- English only: The model is trained exclusively on English text.
- Limited vocabulary: Some categories have as few as 184 unique words (CANCEL), meaning the model relies heavily on keyword matching rather than deep semantic understanding.
- Placeholder artifacts: Training data contains
{{Order Number}},{{Person Name}}, etc. The model has learned to ignore these, but unusual entity formats in real data could affect performance.
Bias and Fairness
- The synthetic dataset does not represent any specific demographic or dialect distribution.
- Performance on non-standard English (e.g., AAVE, Indian English, ESL patterns) has not been evaluated.
- The model may perform differently across age groups, regions, or communication styles.
When NOT to Use This Model
- Safety-critical routing: Do not use as the sole decision-maker for urgent or safety-related tickets without human review.
- Non-English text: The model will produce unreliable predictions on non-English input.
- Fine-grained intent classification: This model classifies into 11 broad categories, not 27 fine-grained intents. If you need intent-level predictions (e.g., distinguishing
cancel_orderfromcheck_cancellation_fee), retrain with theintentcolumn.
Citation
If you use this model, please cite the training dataset:
@misc{bitext2023customer,
title={Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants},
author={Bitext},
year={2023},
publisher={Hugging Face},
url={https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset}
}
License
This model is released under the Apache 2.0 License. The training dataset is licensed under CDLA-Sharing-1.0.
- Downloads last month
- 142
Model tree for Janvi17/customer-support-ticket-classifier
Base model
distilbert/distilbert-base-uncasedDataset used to train Janvi17/customer-support-ticket-classifier
Space using Janvi17/customer-support-ticket-classifier 1
Evaluation results
- Accuracy on Bitext Customer Supporttest set self-reported1.000
- Macro F1 on Bitext Customer Supporttest set self-reported1.000