aneeb15's picture
Update README.md
f64d867 unverified

Banking77 Intent Classifier β€” DistilBERT + LoRA Fine-Tuning

Python PyTorch HuggingFace LoRA

Problem Statement

Customer support systems receive thousands of messages daily. Manually routing each message to the correct department is slow, expensive, and error-prone. This project builds an automated intent classifier that categorizes customer banking queries into 77 distinct intents β€” enabling instant, accurate routing without human intervention.

Real-world challenge: How do you fine-tune a transformer model for production use when you have no GPU, no expensive API costs, and limited compute resources?

Solution: Parameter Efficient Fine-Tuning using LoRA β€” training only 1.17% of model parameters while retaining full model capability.


Dataset

Banking77 β€” Industry standard customer support benchmark

Property Value
Training examples 10,003
Test examples 3,080
Intent classes 77
Average text length 59.5 characters
Class imbalance ratio 5.34x

Sample intents: lost_or_stolen_card, declined_card_payment, change_pin, top_up_by_card, fiat_currency_support


Architecture & Technical Decisions

Why DistilBERT over BERT-base?

Model Parameters Performance Memory
BERT-base 110M 100% High
DistilBERT 66M 97% 40% less

DistilBERT retains 97% of BERT's performance through knowledge distillation while using 40% fewer parameters. Critical for training on free Colab T4 GPU without memory crashes.

Why LoRA?

Full fine-tuning of 66M parameters is expensive and risks catastrophic forgetting β€” where the model overwrites pretrained knowledge with task-specific patterns.

LoRA freezes all pretrained weights and introduces two small adapter matrices alongside each attention layer: Original matrix W: 768 Γ— 768 = 589,824 parameters LoRA Matrix A: 768 Γ— 8 = 6,144 parameters LoRA Matrix B: 8 Γ— 768 = 6,144 parameters Reduction: 98.8% fewer trainable parameters

Result: Total parameters: 67,012,685 Trainable parameters: 797,261 (1.17%)

LoRA Configuration

LoraConfig(
    r=8,                              # rank β€” sweet spot for this task
    lora_alpha=16,                    # scaling factor (2x rank)
    target_modules=["q_lin", "v_lin"],# query and value attention matrices
    lora_dropout=0.1,                 # regularization
    task_type=TaskType.SEQ_CLS        # sequence classification
)

Handling Class Imbalance

Data exploration revealed a 5.34x imbalance between most and least frequent classes (187 vs 35 examples). Training without correction causes the model to ignore rare intents entirely.

Fix: Inverse frequency weighted loss

class_weights = 1.0 / label_counts
# Rare class β†’ high weight β†’ misclassifying it costs more
# Common class β†’ low weight β†’ model cannot ignore rare classes

Training Configuration

TrainingArguments(
    num_train_epochs=5,
    per_device_train_batch_size=32,
    learning_rate=2e-5,          # small to prevent catastrophic forgetting
    warmup_steps=100,            # gradual lr increase at start
    weight_decay=0.01,           # regularization
    fp16=True,                   # half precision β€” 2x memory saving
    eval_strategy="epoch",
    load_best_model_at_end=True
)

Why learning rate 2e-5? Large learning rates aggressively overwrite pretrained weights. 2e-5 gently nudges existing knowledge toward the task without destroying what BERT learned during pretraining.


Results

Training Curve

Epoch Training Loss Validation Loss Accuracy
1 3.9726 3.5859 38.76%
2 2.5550 2.2843 61.14%
3 1.9706 1.7091 68.63%
4 1.6654 1.4714 71.03%
5 1.5524 1.4026 71.73%

Final Test Accuracy: 72.69%

Baseline (random): 1.3% β€” model achieves 56x improvement over random.

Per-Class Performance

Top 5 Best Performing Intents:

Intent F1 Score
verify_top_up 1.000
age_limit 0.976
passcode_forgotten 0.941
edit_personal_details 0.940
get_physical_card 0.925

Top 5 Worst Performing Intents:

Intent F1 Score
topping_up_by_card 0.170
why_verify_identity 0.333
request_refund 0.353
supported_cards_and_currencies 0.353
top_up_by_bank_transfer_charge 0.370

Failure Mode Analysis

Poor performing intents share a common pattern β€” semantic overlap. A customer saying "I want to add money to my card" could legitimately belong to:

  • topping_up_by_card
  • top_up_by_bank_transfer_charge
  • top_up_by_cash
  • transfer_into_account

The model distributes confidence across all similar intents rather than making one strong prediction. This is a dataset limitation β€” Banking77's 77 classes contain genuinely ambiguous boundaries that even human annotators struggle with.

This explains why confidence scores are moderate: "I want to change my PIN" β†’ change_pin (47.46%) β€” clear intent "I lost my card" β†’ card_not_working (14.40%) β€” ambiguous


Inference Demo

test_queries = [
    "I lost my card and need a replacement",
    "Why was my payment declined?",
    "How do I add money to my account?",
    "I want to change my PIN number",
    "What currencies do you support?"
]

Results: Text: Why was my payment declined? Intent: declined_card_payment Confidence: 19.94% Text: I want to change my PIN number Intent: change_pin Confidence: 47.46% Text: What currencies do you support? Intent: fiat_currency_support Confidence: 35.15%


What I Would Improve With More Resources

  1. Larger rank r β€” r=16 or r=32 would give model more capacity to learn complex intent boundaries

  2. More data for rare classes β€” data augmentation or synthetic generation for intents with only 35 examples

  3. Intent merging β€” semantically overlapping intents like topping_up_by_card and top_up_by_bank_transfer_charge could be merged into parent categories

  4. Larger base model β€” RoBERTa-base or DeBERTa would likely push accuracy above 80%

  5. Contrastive learning β€” train model to explicitly push similar intents apart in embedding space


Stack

Component Tool
Base Model distilbert-base-uncased
Fine-tuning HuggingFace PEFT + LoRA
Training PyTorch + HuggingFace Trainer
Dataset HuggingFace Datasets
Evaluation sklearn + evaluate
Compute Google Colab T4 GPU (free)

Project Structure

banking77-intent-classifier/ β”œβ”€β”€ notebook.ipynb # Full pipeline: data β†’ train β†’ eval β†’ inference └── README.md # This file


Author

Syed Muhammad Aneeb Ur Rehman
AI/ML Engineer | Full-Stack Developer
LinkedIn | GitHub