Update README.md

f64d867 unverified about 1 month ago

7.25 kB

	# Banking77 Intent Classifier — DistilBERT + LoRA Fine-Tuning

	![Python](https://img.shields.io/badge/Python-3.10-blue)
	![PyTorch](https://img.shields.io/badge/PyTorch-2.0-orange)
	![HuggingFace](https://img.shields.io/badge/HuggingFace-Transformers-yellow)
	![LoRA](https://img.shields.io/badge/PEFT-LoRA-green)

	## Problem Statement

	Customer support systems receive thousands of messages daily.
	Manually routing each message to the correct department is
	slow, expensive, and error-prone. This project builds an
	automated intent classifier that categorizes customer banking
	queries into 77 distinct intents — enabling instant, accurate
	routing without human intervention.

	Real-world challenge: How do you fine-tune a transformer
	model for production use when you have no GPU, no expensive
	API costs, and limited compute resources?

	Solution: Parameter Efficient Fine-Tuning using LoRA —
	training only 1.17% of model parameters while retaining
	full model capability.

	---

	## Dataset

	Banking77 — Industry standard customer support benchmark

	\| Property \| Value \|
	\|---\|---\|
	\| Training examples \| 10,003 \|
	\| Test examples \| 3,080 \|
	\| Intent classes \| 77 \|
	\| Average text length \| 59.5 characters \|
	\| Class imbalance ratio \| 5.34x \|

	Sample intents: `lost_or_stolen_card`, `declined_card_payment`,
	`change_pin`, `top_up_by_card`, `fiat_currency_support`

	---

	## Architecture & Technical Decisions

	### Why DistilBERT over BERT-base?

	\| Model \| Parameters \| Performance \| Memory \|
	\|---\|---\|---\|---\|
	\| BERT-base \| 110M \| 100% \| High \|
	\| DistilBERT \| 66M \| 97% \| 40% less \|

	DistilBERT retains 97% of BERT's performance through knowledge
	distillation while using 40% fewer parameters. Critical for
	training on free Colab T4 GPU without memory crashes.

	### Why LoRA?

	Full fine-tuning of 66M parameters is expensive and risks
	catastrophic forgetting — where the model overwrites pretrained
	knowledge with task-specific patterns.

	LoRA freezes all pretrained weights and introduces two small
	adapter matrices alongside each attention layer:
	Original matrix W: 768 × 768 = 589,824 parameters
	LoRA Matrix A: 768 × 8 = 6,144 parameters
	LoRA Matrix B: 8 × 768 = 6,144 parameters
	Reduction: 98.8% fewer trainable parameters

	Result:
	Total parameters: 67,012,685
	Trainable parameters: 797,261 (1.17%)

	### LoRA Configuration

	```python
	LoraConfig(
	r=8, # rank — sweet spot for this task
	lora_alpha=16, # scaling factor (2x rank)
	target_modules=["q_lin", "v_lin"],# query and value attention matrices
	lora_dropout=0.1, # regularization
	task_type=TaskType.SEQ_CLS # sequence classification
	)
	```

	### Handling Class Imbalance

	Data exploration revealed a 5.34x imbalance between most and
	least frequent classes (187 vs 35 examples). Training without
	correction causes the model to ignore rare intents entirely.

	Fix: Inverse frequency weighted loss

	```python
	class_weights = 1.0 / label_counts
	# Rare class → high weight → misclassifying it costs more
	# Common class → low weight → model cannot ignore rare classes
	```

	---

	## Training Configuration

	```python
	TrainingArguments(
	num_train_epochs=5,
	per_device_train_batch_size=32,
	learning_rate=2e-5, # small to prevent catastrophic forgetting
	warmup_steps=100, # gradual lr increase at start
	weight_decay=0.01, # regularization
	fp16=True, # half precision — 2x memory saving
	eval_strategy="epoch",
	load_best_model_at_end=True
	)
	```

	Why learning rate 2e-5?
	Large learning rates aggressively overwrite pretrained weights.
	2e-5 gently nudges existing knowledge toward the task without
	destroying what BERT learned during pretraining.

	---

	## Results

	### Training Curve

	\| Epoch \| Training Loss \| Validation Loss \| Accuracy \|
	\|---\|---\|---\|---\|
	\| 1 \| 3.9726 \| 3.5859 \| 38.76% \|
	\| 2 \| 2.5550 \| 2.2843 \| 61.14% \|
	\| 3 \| 1.9706 \| 1.7091 \| 68.63% \|
	\| 4 \| 1.6654 \| 1.4714 \| 71.03% \|
	\| 5 \| 1.5524 \| 1.4026 \| 71.73% \|

	Final Test Accuracy: 72.69%

	Baseline (random): 1.3% — model achieves 56x improvement over random.

	### Per-Class Performance

	Top 5 Best Performing Intents:

	\| Intent \| F1 Score \|
	\|---\|---\|
	\| verify_top_up \| 1.000 \|
	\| age_limit \| 0.976 \|
	\| passcode_forgotten \| 0.941 \|
	\| edit_personal_details \| 0.940 \|
	\| get_physical_card \| 0.925 \|

	Top 5 Worst Performing Intents:

	\| Intent \| F1 Score \|
	\|---\|---\|
	\| topping_up_by_card \| 0.170 \|
	\| why_verify_identity \| 0.333 \|
	\| request_refund \| 0.353 \|
	\| supported_cards_and_currencies \| 0.353 \|
	\| top_up_by_bank_transfer_charge \| 0.370 \|

	### Failure Mode Analysis

	Poor performing intents share a common pattern — semantic
	overlap. A customer saying "I want to add money to my card"
	could legitimately belong to:

	- `topping_up_by_card`
	- `top_up_by_bank_transfer_charge`
	- `top_up_by_cash`
	- `transfer_into_account`

	The model distributes confidence across all similar intents
	rather than making one strong prediction. This is a dataset
	limitation — Banking77's 77 classes contain genuinely
	ambiguous boundaries that even human annotators struggle with.

	This explains why confidence scores are moderate:
	"I want to change my PIN" → change_pin (47.46%) — clear intent
	"I lost my card" → card_not_working (14.40%) — ambiguous

	---

	## Inference Demo

	```python
	test_queries = [
	"I lost my card and need a replacement",
	"Why was my payment declined?",
	"How do I add money to my account?",
	"I want to change my PIN number",
	"What currencies do you support?"
	]
	```

	Results:
	Text: Why was my payment declined?
	Intent: declined_card_payment
	Confidence: 19.94%
	Text: I want to change my PIN number
	Intent: change_pin
	Confidence: 47.46%
	Text: What currencies do you support?
	Intent: fiat_currency_support
	Confidence: 35.15%

	---

	## What I Would Improve With More Resources

	1. Larger rank r — r=16 or r=32 would give model more
	capacity to learn complex intent boundaries

	2. More data for rare classes — data augmentation or
	synthetic generation for intents with only 35 examples

	3. Intent merging — semantically overlapping intents
	like `topping_up_by_card` and `top_up_by_bank_transfer_charge`
	could be merged into parent categories

	4. Larger base model — RoBERTa-base or DeBERTa would
	likely push accuracy above 80%

	5. Contrastive learning — train model to explicitly
	push similar intents apart in embedding space

	---

	## Stack

	\| Component \| Tool \|
	\|---\|---\|
	\| Base Model \| distilbert-base-uncased \|
	\| Fine-tuning \| HuggingFace PEFT + LoRA \|
	\| Training \| PyTorch + HuggingFace Trainer \|
	\| Dataset \| HuggingFace Datasets \|
	\| Evaluation \| sklearn + evaluate \|
	\| Compute \| Google Colab T4 GPU (free) \|

	---

	## Project Structure
	banking77-intent-classifier/
	├── notebook.ipynb # Full pipeline: data → train → eval → inference
	└── README.md # This file

	---

	## Author

	Syed Muhammad Aneeb Ur Rehman
	AI/ML Engineer \| Full-Stack Developer
	[LinkedIn](https://linkedin.com/in/syedaneeb15) \|
	[GitHub](https://github.com/aneebnaqvi15)