Customer Support Ticket Classifier

A fine-tuned DistilBERT model that classifies customer support tickets into 11 issue categories. Designed for automatic routing, triage, and analytics of customer inquiries.

🚀 Try the live demo →

Quick Start

from transformers import pipeline

classifier = pipeline("text-classification", model="Janvi17/customer-support-ticket-classifier")

result = classifier("I want to cancel my subscription immediately")
print(result)
# [{'label': 'SUBSCRIPTION', 'score': 0.997}]

Label	Description	Example
`ACCOUNT`	Account creation, deletion, password, profile changes	"How do I change my account password?"
`CANCEL`	Cancellation fees, policies, contract termination	"What is the fee for canceling the contract?"
`CONTACT`	Reaching customer service, speaking to a human agent	"I want to speak to a human agent"
`DELIVERY`	Delivery options, shipping methods, delivery regions	"Do you ship to Hungary?"
`FEEDBACK`	Reviews, complaints, submitting feedback	"I'd like to leave a review for your services"
`INVOICE`	Viewing, requesting, or locating invoices/bills	"Can you send me the invoice for order #12345?"
`ORDER`	Placing, tracking, modifying, or canceling orders	"I need help cancelling order #55123"
`PAYMENT`	Payment methods, issues, checkout errors	"I get an error when I try to check out"
`REFUND`	Refund requests, refund policy, tracking refunds	"I need a refund for my last purchase"
`SHIPPING`	Shipping address changes, setup, modifications	"I need to update my shipping address"
`SUBSCRIPTION`	Newsletter signup, unsubscribe, subscription management	"Help me unsubscribe from your newsletter"

Usage Examples

Basic Classification

from transformers import pipeline

classifier = pipeline("text-classification", model="Janvi17/customer-support-ticket-classifier")

tickets = [
    "I want to cancel my subscription immediately",
    "Where is my package? I've been waiting for 2 weeks",
    "I need a refund for my last purchase",
    "How do I change my account password?",
    "I want to speak to a human agent",
    "Can you send me the invoice for order #12345?",
]

for ticket in tickets:
    result = classifier(ticket)
    print(f"  [{result[0]['label']:>12s}] (conf: {result[0]['score']:.3f}) {ticket}")

Output:

  [SUBSCRIPTION] (conf: 0.997) I want to cancel my subscription immediately
  [    DELIVERY] (conf: 0.997) Where is my package? I've been waiting for 2 weeks
  [      REFUND] (conf: 0.999) I need a refund for my last purchase
  [     ACCOUNT] (conf: 1.000) How do I change my account password?
  [     CONTACT] (conf: 0.999) I want to speak to a human agent
  [     INVOICE] (conf: 0.997) Can you send me the invoice for order #12345?

Batch Classification with Confidence Scores

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="Janvi17/customer-support-ticket-classifier",
    top_k=3,  # return top 3 predictions
)

result = classifier("The payment for my subscription failed")
for pred in result[0]:
    print(f"  {pred['label']:>14s}: {pred['score']:.4f}")
# PAYMENT:       0.9661
# SUBSCRIPTION:  0.0153
# ORDER:         0.0068

Using with PyTorch Directly

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "Janvi17/customer-support-ticket-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "I need a refund for my last purchase"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)
    pred_id = probs.argmax().item()

print(f"Predicted: {model.config.id2label[pred_id]} ({probs[0][pred_id]:.4f})")
# Predicted: REFUND (0.9995)

Production Usage with Confidence Threshold

For production systems, reject low-confidence predictions:

from transformers import pipeline

classifier = pipeline("text-classification", model="Janvi17/customer-support-ticket-classifier")
CONFIDENCE_THRESHOLD = 0.85

def classify_ticket(text: str) -> dict:
    result = classifier(text)[0]
    if result["score"] < CONFIDENCE_THRESHOLD:
        return {"label": "UNKNOWN", "score": result["score"], "routed_to": "human_review"}
    return {"label": result["label"], "score": result["score"], "routed_to": "auto"}

# High-confidence → auto-routed
print(classify_ticket("I need a refund"))
# {'label': 'REFUND', 'score': 0.999, 'routed_to': 'auto'}

# Low-confidence → human review
print(classify_ticket("asdfghjkl"))
# {'label': 'UNKNOWN', 'score': 0.78, 'routed_to': 'human_review'}

Evaluation Results

Held-Out Test Set (2,464 samples)

Metric	Score
Accuracy	100.00%
Macro F1	100.00%
Weighted F1	100.00%

Per-Class Performance

Category	Precision	Recall	F1-Score	Support
ACCOUNT	1.0000	1.0000	1.0000	545
CANCEL	1.0000	1.0000	1.0000	95
CONTACT	1.0000	1.0000	1.0000	200
DELIVERY	1.0000	1.0000	1.0000	166
FEEDBACK	1.0000	1.0000	1.0000	199
INVOICE	1.0000	1.0000	1.0000	183
ORDER	1.0000	1.0000	1.0000	317
PAYMENT	1.0000	1.0000	1.0000	200
REFUND	1.0000	1.0000	1.0000	262
SHIPPING	1.0000	1.0000	1.0000	197
SUBSCRIPTION	1.0000	1.0000	1.0000	100

Confusion Matrix

Perfect diagonal — zero off-diagonal errors on the held-out test set.

Training Trajectory

Epoch	Train Loss	Val Loss	Val Accuracy	Val Macro F1
1	0.0229	0.0163	99.76%	99.78%
2	0.0042	0.0106	99.68%	99.68%
3	0.0024	0.0054	99.88%	99.88% ✦ best
4	0.0008	0.0091	99.80%	99.80%
5	0.0007	0.0088	99.80%	99.80%

Best checkpoint selected at epoch 3 (highest validation macro F1). Early stopping was configured with patience=2.

Baselines

Method	Accuracy	Macro F1
Random	9.9%	9.1%
Majority class	22.1%	3.3%
TF-IDF + Logistic Regression	99.7%	99.7%
This model (DistilBERT)	100.0%	100.0%

Training Details

Dataset

Bitext Customer Support LLM Chatbot Training Dataset

License: CDLA-Sharing-1.0
Publisher: Bitext
Size: 26,872 rows → 24,635 after deduplication
Splits used: 80% train (19,708) / 10% validation (2,463) / 10% test (2,464), stratified
Language: English
Format: Synthetic, template-generated customer support messages with intentional typos, case variations, and paraphrasing. Includes {{placeholder}} tokens for entities like order numbers and names.

The dataset contains 11 high-level issue categories and 27 fine-grained intents. This model classifies at the category level.

Class Distribution (Training Set)

Category	Count	%
ACCOUNT	4,354	22.1%
ORDER	2,534	12.9%
REFUND	2,098	10.6%
CONTACT	1,599	8.1%
PAYMENT	1,598	8.1%
FEEDBACK	1,598	8.1%
SHIPPING	1,576	8.0%
INVOICE	1,464	7.4%
DELIVERY	1,327	6.7%
SUBSCRIPTION	799	4.1%
CANCEL	760	3.9%

Imbalance ratio: 5.7× (ACCOUNT vs CANCEL). Despite this, macro F1 is perfect — all classes are well-separated in semantic space.

Preprocessing

Exact-duplicate removal (26,872 → 24,635 samples)
Stratified train/val/test split (80/10/10, seed=42)
Tokenization with distilbert-base-uncased tokenizer, max_length=128
Dynamic padding via DataCollatorWithPadding

No text was truncated — the longest tokenized input is 32 tokens.

Hyperparameters

Parameter	Value
Base model	`distilbert/distilbert-base-uncased` (67M params)
Learning rate	2e-5
Batch size	32
Epochs	5 (best checkpoint at epoch 3)
Weight decay	0.01
Warmup steps	308 (10% of total)
LR scheduler	Cosine
Early stopping	Patience = 2 (metric: macro F1)
Precision	fp32 (trained on CPU)
Seed	42

Framework Versions

Transformers 5.7.0
PyTorch 2.11.0
Datasets 4.8.5
Tokenizers 0.22.2

Demo

🚀 Try the live Gradio demo — paste any support ticket and see real-time classification with confidence breakdown.

Limitations and Risks

Known Failure Modes

This model was stress-tested with 28 adversarial inputs. Four systematic weaknesses were identified:

1. Negation Blindness 🔴

The model ignores negation. "Don't refund me, just fix the product" is classified as REFUND (99.95% confidence). The training data contains no negated intents, and DistilBERT's 6-layer architecture has limited compositional reasoning.

Mitigation: Add negated intent examples to training data, or post-process with a negation detector.

2. No Out-of-Distribution Rejection 🔴

The model assigns a label to any input, including gibberish, empty strings, and unrelated text. Examples:

Input	Prediction	Confidence
`"asdfghjkl"`	ORDER	78.1%
`""` (empty)	ORDER	73.7%
`"The quick brown fox..."`	CONTACT	84.9%

Mitigation: Use a confidence threshold (recommended: 0.85) to reject uncertain predictions. See the production usage example above.

3. Heavy Typo Fragility 🟡

While the model handles mild typos well (the training data includes ~34% typo-augmented samples), severely misspelled text can cause misclassification:

Input	Expected	Predicted	Confidence
"hwere is my pakage"	DELIVERY	DELIVERY	48.1% ⚠️
"I wnat to spek to a humna"	CONTACT	ORDER	81.8% ❌

Mitigation: Add a spell-correction preprocessing step, or augment training data with heavier typo injection.

4. Single-Label on Multi-Intent Tickets 🟡

Real support tickets often span multiple categories. The model picks one:

Input	Predicted	Also relevant
"Cancel my subscription and give me a refund"	REFUND	CANCEL, SUBSCRIPTION
"Your delivery is terrible, I want to complain"	DELIVERY	FEEDBACK

Mitigation: For full coverage, return top-k predictions or switch to multi-label classification.

Dataset Limitations

Synthetic data: The training set is template-generated, not sourced from real customer interactions. Real-world text may contain slang, code-switching, or domain-specific jargon not represented in training.
English only: The model is trained exclusively on English text.
Limited vocabulary: Some categories have as few as 184 unique words (CANCEL), meaning the model relies heavily on keyword matching rather than deep semantic understanding.
Placeholder artifacts: Training data contains {{Order Number}}, {{Person Name}}, etc. The model has learned to ignore these, but unusual entity formats in real data could affect performance.

Bias and Fairness

The synthetic dataset does not represent any specific demographic or dialect distribution.
Performance on non-standard English (e.g., AAVE, Indian English, ESL patterns) has not been evaluated.
The model may perform differently across age groups, regions, or communication styles.

When NOT to Use This Model

Safety-critical routing: Do not use as the sole decision-maker for urgent or safety-related tickets without human review.
Non-English text: The model will produce unreliable predictions on non-English input.
Fine-grained intent classification: This model classifies into 11 broad categories, not 27 fine-grained intents. If you need intent-level predictions (e.g., distinguishing cancel_order from check_cancellation_fee), retrain with the intent column.

Citation

If you use this model, please cite the training dataset:

@misc{bitext2023customer,
  title={Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants},
  author={Bitext},
  year={2023},
  publisher={Hugging Face},
  url={https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset}
}

License

This model is released under the Apache 2.0 License. The training dataset is licensed under CDLA-Sharing-1.0.

Downloads last month: 142

Safetensors

Model size

67M params

Tensor type

F32

Model tree for Janvi17/customer-support-ticket-classifier

Base model

distilbert/distilbert-base-uncased

Finetuned

(11488)

this model

Dataset used to train Janvi17/customer-support-ticket-classifier

Space using Janvi17/customer-support-ticket-classifier 1

Evaluation results

Accuracy on Bitext Customer Support
test set self-reported

1.000
Macro F1 on Bitext Customer Support
test set self-reported

1.000

Janvi17
/

customer-support-ticket-classifier