# Banking77 Intent Classifier — DistilBERT + LoRA Fine-Tuning ![Python](https://img.shields.io/badge/Python-3.10-blue) ![PyTorch](https://img.shields.io/badge/PyTorch-2.0-orange) ![HuggingFace](https://img.shields.io/badge/HuggingFace-Transformers-yellow) ![LoRA](https://img.shields.io/badge/PEFT-LoRA-green) ## Problem Statement Customer support systems receive thousands of messages daily. Manually routing each message to the correct department is slow, expensive, and error-prone. This project builds an automated intent classifier that categorizes customer banking queries into 77 distinct intents — enabling instant, accurate routing without human intervention. **Real-world challenge:** How do you fine-tune a transformer model for production use when you have no GPU, no expensive API costs, and limited compute resources? **Solution:** Parameter Efficient Fine-Tuning using LoRA — training only 1.17% of model parameters while retaining full model capability. --- ## Dataset **Banking77** — Industry standard customer support benchmark | Property | Value | |---|---| | Training examples | 10,003 | | Test examples | 3,080 | | Intent classes | 77 | | Average text length | 59.5 characters | | Class imbalance ratio | 5.34x | Sample intents: `lost_or_stolen_card`, `declined_card_payment`, `change_pin`, `top_up_by_card`, `fiat_currency_support` --- ## Architecture & Technical Decisions ### Why DistilBERT over BERT-base? | Model | Parameters | Performance | Memory | |---|---|---|---| | BERT-base | 110M | 100% | High | | DistilBERT | 66M | 97% | 40% less | DistilBERT retains 97% of BERT's performance through knowledge distillation while using 40% fewer parameters. Critical for training on free Colab T4 GPU without memory crashes. ### Why LoRA? Full fine-tuning of 66M parameters is expensive and risks catastrophic forgetting — where the model overwrites pretrained knowledge with task-specific patterns. LoRA freezes all pretrained weights and introduces two small adapter matrices alongside each attention layer: Original matrix W: 768 × 768 = 589,824 parameters LoRA Matrix A: 768 × 8 = 6,144 parameters LoRA Matrix B: 8 × 768 = 6,144 parameters Reduction: 98.8% fewer trainable parameters **Result:** Total parameters: 67,012,685 Trainable parameters: 797,261 (1.17%) ### LoRA Configuration ```python LoraConfig( r=8, # rank — sweet spot for this task lora_alpha=16, # scaling factor (2x rank) target_modules=["q_lin", "v_lin"],# query and value attention matrices lora_dropout=0.1, # regularization task_type=TaskType.SEQ_CLS # sequence classification ) ``` ### Handling Class Imbalance Data exploration revealed a 5.34x imbalance between most and least frequent classes (187 vs 35 examples). Training without correction causes the model to ignore rare intents entirely. **Fix: Inverse frequency weighted loss** ```python class_weights = 1.0 / label_counts # Rare class → high weight → misclassifying it costs more # Common class → low weight → model cannot ignore rare classes ``` --- ## Training Configuration ```python TrainingArguments( num_train_epochs=5, per_device_train_batch_size=32, learning_rate=2e-5, # small to prevent catastrophic forgetting warmup_steps=100, # gradual lr increase at start weight_decay=0.01, # regularization fp16=True, # half precision — 2x memory saving eval_strategy="epoch", load_best_model_at_end=True ) ``` **Why learning rate 2e-5?** Large learning rates aggressively overwrite pretrained weights. 2e-5 gently nudges existing knowledge toward the task without destroying what BERT learned during pretraining. --- ## Results ### Training Curve | Epoch | Training Loss | Validation Loss | Accuracy | |---|---|---|---| | 1 | 3.9726 | 3.5859 | 38.76% | | 2 | 2.5550 | 2.2843 | 61.14% | | 3 | 1.9706 | 1.7091 | 68.63% | | 4 | 1.6654 | 1.4714 | 71.03% | | 5 | 1.5524 | 1.4026 | 71.73% | **Final Test Accuracy: 72.69%** Baseline (random): 1.3% — model achieves 56x improvement over random. ### Per-Class Performance **Top 5 Best Performing Intents:** | Intent | F1 Score | |---|---| | verify_top_up | 1.000 | | age_limit | 0.976 | | passcode_forgotten | 0.941 | | edit_personal_details | 0.940 | | get_physical_card | 0.925 | **Top 5 Worst Performing Intents:** | Intent | F1 Score | |---|---| | topping_up_by_card | 0.170 | | why_verify_identity | 0.333 | | request_refund | 0.353 | | supported_cards_and_currencies | 0.353 | | top_up_by_bank_transfer_charge | 0.370 | ### Failure Mode Analysis Poor performing intents share a common pattern — semantic overlap. A customer saying "I want to add money to my card" could legitimately belong to: - `topping_up_by_card` - `top_up_by_bank_transfer_charge` - `top_up_by_cash` - `transfer_into_account` The model distributes confidence across all similar intents rather than making one strong prediction. This is a dataset limitation — Banking77's 77 classes contain genuinely ambiguous boundaries that even human annotators struggle with. **This explains why confidence scores are moderate:** "I want to change my PIN" → change_pin (47.46%) — clear intent "I lost my card" → card_not_working (14.40%) — ambiguous --- ## Inference Demo ```python test_queries = [ "I lost my card and need a replacement", "Why was my payment declined?", "How do I add money to my account?", "I want to change my PIN number", "What currencies do you support?" ] ``` **Results:** Text: Why was my payment declined? Intent: declined_card_payment Confidence: 19.94% Text: I want to change my PIN number Intent: change_pin Confidence: 47.46% Text: What currencies do you support? Intent: fiat_currency_support Confidence: 35.15% --- ## What I Would Improve With More Resources 1. **Larger rank r** — r=16 or r=32 would give model more capacity to learn complex intent boundaries 2. **More data for rare classes** — data augmentation or synthetic generation for intents with only 35 examples 3. **Intent merging** — semantically overlapping intents like `topping_up_by_card` and `top_up_by_bank_transfer_charge` could be merged into parent categories 4. **Larger base model** — RoBERTa-base or DeBERTa would likely push accuracy above 80% 5. **Contrastive learning** — train model to explicitly push similar intents apart in embedding space --- ## Stack | Component | Tool | |---|---| | Base Model | distilbert-base-uncased | | Fine-tuning | HuggingFace PEFT + LoRA | | Training | PyTorch + HuggingFace Trainer | | Dataset | HuggingFace Datasets | | Evaluation | sklearn + evaluate | | Compute | Google Colab T4 GPU (free) | --- ## Project Structure banking77-intent-classifier/ ├── notebook.ipynb # Full pipeline: data → train → eval → inference └── README.md # This file --- ## Author **Syed Muhammad Aneeb Ur Rehman** AI/ML Engineer | Full-Stack Developer [LinkedIn](https://linkedin.com/in/syedaneeb15) | [GitHub](https://github.com/aneebnaqvi15)