| # Banking77 Intent Classifier β DistilBERT + LoRA Fine-Tuning |
|
|
|  |
|  |
|  |
|  |
|
|
| ## Problem Statement |
|
|
| Customer support systems receive thousands of messages daily. |
| Manually routing each message to the correct department is |
| slow, expensive, and error-prone. This project builds an |
| automated intent classifier that categorizes customer banking |
| queries into 77 distinct intents β enabling instant, accurate |
| routing without human intervention. |
|
|
| **Real-world challenge:** How do you fine-tune a transformer |
| model for production use when you have no GPU, no expensive |
| API costs, and limited compute resources? |
|
|
| **Solution:** Parameter Efficient Fine-Tuning using LoRA β |
| training only 1.17% of model parameters while retaining |
| full model capability. |
|
|
| --- |
|
|
| ## Dataset |
|
|
| **Banking77** β Industry standard customer support benchmark |
|
|
| | Property | Value | |
| |---|---| |
| | Training examples | 10,003 | |
| | Test examples | 3,080 | |
| | Intent classes | 77 | |
| | Average text length | 59.5 characters | |
| | Class imbalance ratio | 5.34x | |
|
|
| Sample intents: `lost_or_stolen_card`, `declined_card_payment`, |
| `change_pin`, `top_up_by_card`, `fiat_currency_support` |
|
|
| --- |
|
|
| ## Architecture & Technical Decisions |
|
|
| ### Why DistilBERT over BERT-base? |
|
|
| | Model | Parameters | Performance | Memory | |
| |---|---|---|---| |
| | BERT-base | 110M | 100% | High | |
| | DistilBERT | 66M | 97% | 40% less | |
|
|
| DistilBERT retains 97% of BERT's performance through knowledge |
| distillation while using 40% fewer parameters. Critical for |
| training on free Colab T4 GPU without memory crashes. |
|
|
| ### Why LoRA? |
|
|
| Full fine-tuning of 66M parameters is expensive and risks |
| catastrophic forgetting β where the model overwrites pretrained |
| knowledge with task-specific patterns. |
|
|
| LoRA freezes all pretrained weights and introduces two small |
| adapter matrices alongside each attention layer: |
| Original matrix W: 768 Γ 768 = 589,824 parameters |
| LoRA Matrix A: 768 Γ 8 = 6,144 parameters |
| LoRA Matrix B: 8 Γ 768 = 6,144 parameters |
| Reduction: 98.8% fewer trainable parameters |
|
|
| **Result:** |
| Total parameters: 67,012,685 |
| Trainable parameters: 797,261 (1.17%) |
|
|
| ### LoRA Configuration |
|
|
| ```python |
| LoraConfig( |
| r=8, # rank β sweet spot for this task |
| lora_alpha=16, # scaling factor (2x rank) |
| target_modules=["q_lin", "v_lin"],# query and value attention matrices |
| lora_dropout=0.1, # regularization |
| task_type=TaskType.SEQ_CLS # sequence classification |
| ) |
| ``` |
|
|
| ### Handling Class Imbalance |
|
|
| Data exploration revealed a 5.34x imbalance between most and |
| least frequent classes (187 vs 35 examples). Training without |
| correction causes the model to ignore rare intents entirely. |
|
|
| **Fix: Inverse frequency weighted loss** |
|
|
| ```python |
| class_weights = 1.0 / label_counts |
| # Rare class β high weight β misclassifying it costs more |
| # Common class β low weight β model cannot ignore rare classes |
| ``` |
|
|
| --- |
|
|
| ## Training Configuration |
|
|
| ```python |
| TrainingArguments( |
| num_train_epochs=5, |
| per_device_train_batch_size=32, |
| learning_rate=2e-5, # small to prevent catastrophic forgetting |
| warmup_steps=100, # gradual lr increase at start |
| weight_decay=0.01, # regularization |
| fp16=True, # half precision β 2x memory saving |
| eval_strategy="epoch", |
| load_best_model_at_end=True |
| ) |
| ``` |
|
|
| **Why learning rate 2e-5?** |
| Large learning rates aggressively overwrite pretrained weights. |
| 2e-5 gently nudges existing knowledge toward the task without |
| destroying what BERT learned during pretraining. |
|
|
| --- |
|
|
| ## Results |
|
|
| ### Training Curve |
|
|
| | Epoch | Training Loss | Validation Loss | Accuracy | |
| |---|---|---|---| |
| | 1 | 3.9726 | 3.5859 | 38.76% | |
| | 2 | 2.5550 | 2.2843 | 61.14% | |
| | 3 | 1.9706 | 1.7091 | 68.63% | |
| | 4 | 1.6654 | 1.4714 | 71.03% | |
| | 5 | 1.5524 | 1.4026 | 71.73% | |
|
|
| **Final Test Accuracy: 72.69%** |
|
|
| Baseline (random): 1.3% β model achieves 56x improvement over random. |
|
|
| ### Per-Class Performance |
|
|
| **Top 5 Best Performing Intents:** |
|
|
| | Intent | F1 Score | |
| |---|---| |
| | verify_top_up | 1.000 | |
| | age_limit | 0.976 | |
| | passcode_forgotten | 0.941 | |
| | edit_personal_details | 0.940 | |
| | get_physical_card | 0.925 | |
|
|
| **Top 5 Worst Performing Intents:** |
|
|
| | Intent | F1 Score | |
| |---|---| |
| | topping_up_by_card | 0.170 | |
| | why_verify_identity | 0.333 | |
| | request_refund | 0.353 | |
| | supported_cards_and_currencies | 0.353 | |
| | top_up_by_bank_transfer_charge | 0.370 | |
|
|
| ### Failure Mode Analysis |
|
|
| Poor performing intents share a common pattern β semantic |
| overlap. A customer saying "I want to add money to my card" |
| could legitimately belong to: |
|
|
| - `topping_up_by_card` |
| - `top_up_by_bank_transfer_charge` |
| - `top_up_by_cash` |
| - `transfer_into_account` |
|
|
| The model distributes confidence across all similar intents |
| rather than making one strong prediction. This is a dataset |
| limitation β Banking77's 77 classes contain genuinely |
| ambiguous boundaries that even human annotators struggle with. |
|
|
| **This explains why confidence scores are moderate:** |
| "I want to change my PIN" β change_pin (47.46%) β clear intent |
| "I lost my card" β card_not_working (14.40%) β ambiguous |
| |
| --- |
| |
| ## Inference Demo |
| |
| ```python |
| test_queries = [ |
| "I lost my card and need a replacement", |
| "Why was my payment declined?", |
| "How do I add money to my account?", |
| "I want to change my PIN number", |
| "What currencies do you support?" |
| ] |
| ``` |
| |
| **Results:** |
| Text: Why was my payment declined? |
| Intent: declined_card_payment |
| Confidence: 19.94% |
| Text: I want to change my PIN number |
| Intent: change_pin |
| Confidence: 47.46% |
| Text: What currencies do you support? |
| Intent: fiat_currency_support |
| Confidence: 35.15% |
| |
| --- |
| |
| ## What I Would Improve With More Resources |
| |
| 1. **Larger rank r** β r=16 or r=32 would give model more |
| capacity to learn complex intent boundaries |
| |
| 2. **More data for rare classes** β data augmentation or |
| synthetic generation for intents with only 35 examples |
| |
| 3. **Intent merging** β semantically overlapping intents |
| like `topping_up_by_card` and `top_up_by_bank_transfer_charge` |
| could be merged into parent categories |
| |
| 4. **Larger base model** β RoBERTa-base or DeBERTa would |
| likely push accuracy above 80% |
| |
| 5. **Contrastive learning** β train model to explicitly |
| push similar intents apart in embedding space |
| |
| --- |
| |
| ## Stack |
| |
| | Component | Tool | |
| |---|---| |
| | Base Model | distilbert-base-uncased | |
| | Fine-tuning | HuggingFace PEFT + LoRA | |
| | Training | PyTorch + HuggingFace Trainer | |
| | Dataset | HuggingFace Datasets | |
| | Evaluation | sklearn + evaluate | |
| | Compute | Google Colab T4 GPU (free) | |
| |
| --- |
| |
| ## Project Structure |
| banking77-intent-classifier/ |
| βββ notebook.ipynb # Full pipeline: data β train β eval β inference |
| βββ README.md # This file |
| |
| --- |
| |
| ## Author |
| |
| **Syed Muhammad Aneeb Ur Rehman** |
| AI/ML Engineer | Full-Stack Developer |
| [LinkedIn](https://linkedin.com/in/syedaneeb15) | |
| [GitHub](https://github.com/aneebnaqvi15) |
| |