File size: 7,247 Bytes
ca573f9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 | # Banking77 Intent Classifier β DistilBERT + LoRA Fine-Tuning




## Problem Statement
Customer support systems receive thousands of messages daily.
Manually routing each message to the correct department is
slow, expensive, and error-prone. This project builds an
automated intent classifier that categorizes customer banking
queries into 77 distinct intents β enabling instant, accurate
routing without human intervention.
**Real-world challenge:** How do you fine-tune a transformer
model for production use when you have no GPU, no expensive
API costs, and limited compute resources?
**Solution:** Parameter Efficient Fine-Tuning using LoRA β
training only 1.17% of model parameters while retaining
full model capability.
---
## Dataset
**Banking77** β Industry standard customer support benchmark
| Property | Value |
|---|---|
| Training examples | 10,003 |
| Test examples | 3,080 |
| Intent classes | 77 |
| Average text length | 59.5 characters |
| Class imbalance ratio | 5.34x |
Sample intents: `lost_or_stolen_card`, `declined_card_payment`,
`change_pin`, `top_up_by_card`, `fiat_currency_support`
---
## Architecture & Technical Decisions
### Why DistilBERT over BERT-base?
| Model | Parameters | Performance | Memory |
|---|---|---|---|
| BERT-base | 110M | 100% | High |
| DistilBERT | 66M | 97% | 40% less |
DistilBERT retains 97% of BERT's performance through knowledge
distillation while using 40% fewer parameters. Critical for
training on free Colab T4 GPU without memory crashes.
### Why LoRA?
Full fine-tuning of 66M parameters is expensive and risks
catastrophic forgetting β where the model overwrites pretrained
knowledge with task-specific patterns.
LoRA freezes all pretrained weights and introduces two small
adapter matrices alongside each attention layer:
Original matrix W: 768 Γ 768 = 589,824 parameters
LoRA Matrix A: 768 Γ 8 = 6,144 parameters
LoRA Matrix B: 8 Γ 768 = 6,144 parameters
Reduction: 98.8% fewer trainable parameters
**Result:**
Total parameters: 67,012,685
Trainable parameters: 797,261 (1.17%)
### LoRA Configuration
```python
LoraConfig(
r=8, # rank β sweet spot for this task
lora_alpha=16, # scaling factor (2x rank)
target_modules=["q_lin", "v_lin"],# query and value attention matrices
lora_dropout=0.1, # regularization
task_type=TaskType.SEQ_CLS # sequence classification
)
```
### Handling Class Imbalance
Data exploration revealed a 5.34x imbalance between most and
least frequent classes (187 vs 35 examples). Training without
correction causes the model to ignore rare intents entirely.
**Fix: Inverse frequency weighted loss**
```python
class_weights = 1.0 / label_counts
# Rare class β high weight β misclassifying it costs more
# Common class β low weight β model cannot ignore rare classes
```
---
## Training Configuration
```python
TrainingArguments(
num_train_epochs=5,
per_device_train_batch_size=32,
learning_rate=2e-5, # small to prevent catastrophic forgetting
warmup_steps=100, # gradual lr increase at start
weight_decay=0.01, # regularization
fp16=True, # half precision β 2x memory saving
eval_strategy="epoch",
load_best_model_at_end=True
)
```
**Why learning rate 2e-5?**
Large learning rates aggressively overwrite pretrained weights.
2e-5 gently nudges existing knowledge toward the task without
destroying what BERT learned during pretraining.
---
## Results
### Training Curve
| Epoch | Training Loss | Validation Loss | Accuracy |
|---|---|---|---|
| 1 | 3.9726 | 3.5859 | 38.76% |
| 2 | 2.5550 | 2.2843 | 61.14% |
| 3 | 1.9706 | 1.7091 | 68.63% |
| 4 | 1.6654 | 1.4714 | 71.03% |
| 5 | 1.5524 | 1.4026 | 71.73% |
**Final Test Accuracy: 72.69%**
Baseline (random): 1.3% β model achieves 56x improvement over random.
### Per-Class Performance
**Top 5 Best Performing Intents:**
| Intent | F1 Score |
|---|---|
| verify_top_up | 1.000 |
| age_limit | 0.976 |
| passcode_forgotten | 0.941 |
| edit_personal_details | 0.940 |
| get_physical_card | 0.925 |
**Top 5 Worst Performing Intents:**
| Intent | F1 Score |
|---|---|
| topping_up_by_card | 0.170 |
| why_verify_identity | 0.333 |
| request_refund | 0.353 |
| supported_cards_and_currencies | 0.353 |
| top_up_by_bank_transfer_charge | 0.370 |
### Failure Mode Analysis
Poor performing intents share a common pattern β semantic
overlap. A customer saying "I want to add money to my card"
could legitimately belong to:
- `topping_up_by_card`
- `top_up_by_bank_transfer_charge`
- `top_up_by_cash`
- `transfer_into_account`
The model distributes confidence across all similar intents
rather than making one strong prediction. This is a dataset
limitation β Banking77's 77 classes contain genuinely
ambiguous boundaries that even human annotators struggle with.
**This explains why confidence scores are moderate:**
"I want to change my PIN" β change_pin (47.46%) β clear intent
"I lost my card" β card_not_working (14.40%) β ambiguous
---
## Inference Demo
```python
test_queries = [
"I lost my card and need a replacement",
"Why was my payment declined?",
"How do I add money to my account?",
"I want to change my PIN number",
"What currencies do you support?"
]
```
**Results:**
Text: Why was my payment declined?
Intent: declined_card_payment
Confidence: 19.94%
Text: I want to change my PIN number
Intent: change_pin
Confidence: 47.46%
Text: What currencies do you support?
Intent: fiat_currency_support
Confidence: 35.15%
---
## What I Would Improve With More Resources
1. **Larger rank r** β r=16 or r=32 would give model more
capacity to learn complex intent boundaries
2. **More data for rare classes** β data augmentation or
synthetic generation for intents with only 35 examples
3. **Intent merging** β semantically overlapping intents
like `topping_up_by_card` and `top_up_by_bank_transfer_charge`
could be merged into parent categories
4. **Larger base model** β RoBERTa-base or DeBERTa would
likely push accuracy above 80%
5. **Contrastive learning** β train model to explicitly
push similar intents apart in embedding space
---
## Stack
| Component | Tool |
|---|---|
| Base Model | distilbert-base-uncased |
| Fine-tuning | HuggingFace PEFT + LoRA |
| Training | PyTorch + HuggingFace Trainer |
| Dataset | HuggingFace Datasets |
| Evaluation | sklearn + evaluate |
| Compute | Google Colab T4 GPU (free) |
---
## Project Structure
banking77-intent-classifier/
βββ notebook.ipynb # Full pipeline: data β train β eval β inference
βββ README.md # This file
---
## Author
**Syed Muhammad Aneeb Ur Rehman**
AI/ML Engineer | Full-Stack Developer
[LinkedIn](https://linkedin.com/in/syedaneeb15) |
[GitHub](https://github.com/aneebnaqvi15)
|