Customer Support Ticket Classifier

A fine-tuned DistilBERT model that classifies customer support tickets into 11 issue categories. Designed for automatic routing, triage, and analytics of customer inquiries.

🚀 Try the live demo →

Quick Start

from transformers import pipeline

classifier = pipeline("text-classification", model="Janvi17/customer-support-ticket-classifier")

result = classifier("I want to cancel my subscription immediately")
print(result)
# [{'label': 'SUBSCRIPTION', 'score': 0.997}]

Categories

The model classifies text into one of 11 customer support issue types:

Label Description Example
ACCOUNT Account creation, deletion, password, profile changes "How do I change my account password?"
CANCEL Cancellation fees, policies, contract termination "What is the fee for canceling the contract?"
CONTACT Reaching customer service, speaking to a human agent "I want to speak to a human agent"
DELIVERY Delivery options, shipping methods, delivery regions "Do you ship to Hungary?"
FEEDBACK Reviews, complaints, submitting feedback "I'd like to leave a review for your services"
INVOICE Viewing, requesting, or locating invoices/bills "Can you send me the invoice for order #12345?"
ORDER Placing, tracking, modifying, or canceling orders "I need help cancelling order #55123"
PAYMENT Payment methods, issues, checkout errors "I get an error when I try to check out"
REFUND Refund requests, refund policy, tracking refunds "I need a refund for my last purchase"
SHIPPING Shipping address changes, setup, modifications "I need to update my shipping address"
SUBSCRIPTION Newsletter signup, unsubscribe, subscription management "Help me unsubscribe from your newsletter"

Usage Examples

Basic Classification

from transformers import pipeline

classifier = pipeline("text-classification", model="Janvi17/customer-support-ticket-classifier")

tickets = [
    "I want to cancel my subscription immediately",
    "Where is my package? I've been waiting for 2 weeks",
    "I need a refund for my last purchase",
    "How do I change my account password?",
    "I want to speak to a human agent",
    "Can you send me the invoice for order #12345?",
]

for ticket in tickets:
    result = classifier(ticket)
    print(f"  [{result[0]['label']:>12s}] (conf: {result[0]['score']:.3f}) {ticket}")

Output:

  [SUBSCRIPTION] (conf: 0.997) I want to cancel my subscription immediately
  [    DELIVERY] (conf: 0.997) Where is my package? I've been waiting for 2 weeks
  [      REFUND] (conf: 0.999) I need a refund for my last purchase
  [     ACCOUNT] (conf: 1.000) How do I change my account password?
  [     CONTACT] (conf: 0.999) I want to speak to a human agent
  [     INVOICE] (conf: 0.997) Can you send me the invoice for order #12345?

Batch Classification with Confidence Scores

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="Janvi17/customer-support-ticket-classifier",
    top_k=3,  # return top 3 predictions
)

result = classifier("The payment for my subscription failed")
for pred in result[0]:
    print(f"  {pred['label']:>14s}: {pred['score']:.4f}")
# PAYMENT:       0.9661
# SUBSCRIPTION:  0.0153
# ORDER:         0.0068

Using with PyTorch Directly

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "Janvi17/customer-support-ticket-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "I need a refund for my last purchase"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)
    pred_id = probs.argmax().item()

print(f"Predicted: {model.config.id2label[pred_id]} ({probs[0][pred_id]:.4f})")
# Predicted: REFUND (0.9995)

Production Usage with Confidence Threshold

For production systems, reject low-confidence predictions:

from transformers import pipeline

classifier = pipeline("text-classification", model="Janvi17/customer-support-ticket-classifier")
CONFIDENCE_THRESHOLD = 0.85

def classify_ticket(text: str) -> dict:
    result = classifier(text)[0]
    if result["score"] < CONFIDENCE_THRESHOLD:
        return {"label": "UNKNOWN", "score": result["score"], "routed_to": "human_review"}
    return {"label": result["label"], "score": result["score"], "routed_to": "auto"}

# High-confidence → auto-routed
print(classify_ticket("I need a refund"))
# {'label': 'REFUND', 'score': 0.999, 'routed_to': 'auto'}

# Low-confidence → human review
print(classify_ticket("asdfghjkl"))
# {'label': 'UNKNOWN', 'score': 0.78, 'routed_to': 'human_review'}

Evaluation Results

Held-Out Test Set (2,464 samples)

Metric Score
Accuracy 100.00%
Macro F1 100.00%
Weighted F1 100.00%

Per-Class Performance

Category Precision Recall F1-Score Support
ACCOUNT 1.0000 1.0000 1.0000 545
CANCEL 1.0000 1.0000 1.0000 95
CONTACT 1.0000 1.0000 1.0000 200
DELIVERY 1.0000 1.0000 1.0000 166
FEEDBACK 1.0000 1.0000 1.0000 199
INVOICE 1.0000 1.0000 1.0000 183
ORDER 1.0000 1.0000 1.0000 317
PAYMENT 1.0000 1.0000 1.0000 200
REFUND 1.0000 1.0000 1.0000 262
SHIPPING 1.0000 1.0000 1.0000 197
SUBSCRIPTION 1.0000 1.0000 1.0000 100

Confusion Matrix

Perfect diagonal — zero off-diagonal errors on the held-out test set.

Training Trajectory

Epoch Train Loss Val Loss Val Accuracy Val Macro F1
1 0.0229 0.0163 99.76% 99.78%
2 0.0042 0.0106 99.68% 99.68%
3 0.0024 0.0054 99.88% 99.88% ✦ best
4 0.0008 0.0091 99.80% 99.80%
5 0.0007 0.0088 99.80% 99.80%

Best checkpoint selected at epoch 3 (highest validation macro F1). Early stopping was configured with patience=2.

Baselines

Method Accuracy Macro F1
Random 9.9% 9.1%
Majority class 22.1% 3.3%
TF-IDF + Logistic Regression 99.7% 99.7%
This model (DistilBERT) 100.0% 100.0%

Training Details

Dataset

Bitext Customer Support LLM Chatbot Training Dataset

  • License: CDLA-Sharing-1.0
  • Publisher: Bitext
  • Size: 26,872 rows → 24,635 after deduplication
  • Splits used: 80% train (19,708) / 10% validation (2,463) / 10% test (2,464), stratified
  • Language: English
  • Format: Synthetic, template-generated customer support messages with intentional typos, case variations, and paraphrasing. Includes {{placeholder}} tokens for entities like order numbers and names.

The dataset contains 11 high-level issue categories and 27 fine-grained intents. This model classifies at the category level.

Class Distribution (Training Set)

Category Count %
ACCOUNT 4,354 22.1%
ORDER 2,534 12.9%
REFUND 2,098 10.6%
CONTACT 1,599 8.1%
PAYMENT 1,598 8.1%
FEEDBACK 1,598 8.1%
SHIPPING 1,576 8.0%
INVOICE 1,464 7.4%
DELIVERY 1,327 6.7%
SUBSCRIPTION 799 4.1%
CANCEL 760 3.9%

Imbalance ratio: 5.7× (ACCOUNT vs CANCEL). Despite this, macro F1 is perfect — all classes are well-separated in semantic space.

Preprocessing

  1. Exact-duplicate removal (26,872 → 24,635 samples)
  2. Stratified train/val/test split (80/10/10, seed=42)
  3. Tokenization with distilbert-base-uncased tokenizer, max_length=128
  4. Dynamic padding via DataCollatorWithPadding

No text was truncated — the longest tokenized input is 32 tokens.

Hyperparameters

Parameter Value
Base model distilbert/distilbert-base-uncased (67M params)
Learning rate 2e-5
Batch size 32
Epochs 5 (best checkpoint at epoch 3)
Weight decay 0.01
Warmup steps 308 (10% of total)
LR scheduler Cosine
Early stopping Patience = 2 (metric: macro F1)
Precision fp32 (trained on CPU)
Seed 42

Framework Versions

  • Transformers 5.7.0
  • PyTorch 2.11.0
  • Datasets 4.8.5
  • Tokenizers 0.22.2

Demo

🚀 Try the live Gradio demo — paste any support ticket and see real-time classification with confidence breakdown.

Limitations and Risks

Known Failure Modes

This model was stress-tested with 28 adversarial inputs. Four systematic weaknesses were identified:

1. Negation Blindness 🔴

The model ignores negation. "Don't refund me, just fix the product" is classified as REFUND (99.95% confidence). The training data contains no negated intents, and DistilBERT's 6-layer architecture has limited compositional reasoning.

Mitigation: Add negated intent examples to training data, or post-process with a negation detector.

2. No Out-of-Distribution Rejection 🔴

The model assigns a label to any input, including gibberish, empty strings, and unrelated text. Examples:

Input Prediction Confidence
"asdfghjkl" ORDER 78.1%
"" (empty) ORDER 73.7%
"The quick brown fox..." CONTACT 84.9%

Mitigation: Use a confidence threshold (recommended: 0.85) to reject uncertain predictions. See the production usage example above.

3. Heavy Typo Fragility 🟡

While the model handles mild typos well (the training data includes ~34% typo-augmented samples), severely misspelled text can cause misclassification:

Input Expected Predicted Confidence
"hwere is my pakage" DELIVERY DELIVERY 48.1% ⚠️
"I wnat to spek to a humna" CONTACT ORDER 81.8% ❌

Mitigation: Add a spell-correction preprocessing step, or augment training data with heavier typo injection.

4. Single-Label on Multi-Intent Tickets 🟡

Real support tickets often span multiple categories. The model picks one:

Input Predicted Also relevant
"Cancel my subscription and give me a refund" REFUND CANCEL, SUBSCRIPTION
"Your delivery is terrible, I want to complain" DELIVERY FEEDBACK

Mitigation: For full coverage, return top-k predictions or switch to multi-label classification.

Dataset Limitations

  • Synthetic data: The training set is template-generated, not sourced from real customer interactions. Real-world text may contain slang, code-switching, or domain-specific jargon not represented in training.
  • English only: The model is trained exclusively on English text.
  • Limited vocabulary: Some categories have as few as 184 unique words (CANCEL), meaning the model relies heavily on keyword matching rather than deep semantic understanding.
  • Placeholder artifacts: Training data contains {{Order Number}}, {{Person Name}}, etc. The model has learned to ignore these, but unusual entity formats in real data could affect performance.

Bias and Fairness

  • The synthetic dataset does not represent any specific demographic or dialect distribution.
  • Performance on non-standard English (e.g., AAVE, Indian English, ESL patterns) has not been evaluated.
  • The model may perform differently across age groups, regions, or communication styles.

When NOT to Use This Model

  • Safety-critical routing: Do not use as the sole decision-maker for urgent or safety-related tickets without human review.
  • Non-English text: The model will produce unreliable predictions on non-English input.
  • Fine-grained intent classification: This model classifies into 11 broad categories, not 27 fine-grained intents. If you need intent-level predictions (e.g., distinguishing cancel_order from check_cancellation_fee), retrain with the intent column.

Citation

If you use this model, please cite the training dataset:

@misc{bitext2023customer,
  title={Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants},
  author={Bitext},
  year={2023},
  publisher={Hugging Face},
  url={https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset}
}

License

This model is released under the Apache 2.0 License. The training dataset is licensed under CDLA-Sharing-1.0.

Downloads last month
142
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Janvi17/customer-support-ticket-classifier

Finetuned
(11488)
this model

Dataset used to train Janvi17/customer-support-ticket-classifier

Space using Janvi17/customer-support-ticket-classifier 1

Evaluation results