π³ Expense Tracker β Indian SMS Classifier
A fine-tuned distilbert-base-uncased model that classifies Indian bank/UPI/payment SMS messages
into expense categories. Built with LoRA (Low-Rank Adaptation) for better generalization
on small datasets.
π·οΈ Categories
| ID | Category | Description | Example SMS |
|---|---|---|---|
| 0 | Bills | Utility bills, subscriptions, EMI | "Electricity bill Rs.1340 paid" |
| 1 | Expense | Generic bank debits, UPI transfers | "A/c debited by Rs.1530. Bal Rs.3303." |
| 2 | Food | Food delivery, restaurant orders | "Food order Rs.345 confirmed. Delivery 20 mins." |
| 3 | Income | Credits, salary, refunds, cashback | "Rs.45000 credited. Salary for March." |
| 4 | Recharge | Mobile/DTH recharge | "Rs.239 recharged. Validity 28 days." |
| 5 | Transport | Rides, flights, trains, toll | "Ride completed. Rs.234 charged." |
π Quick Start
from transformers import pipeline
clf = pipeline("text-classification",
model="udayugale/expense-tracker-distilbert-v3")
# Single SMS
result = clf("Food order Rs.345 confirmed. Delivery in 20 mins. Enjoy your meal!")
print(result)
# [{'label': 'Food', 'score': 0.97}]
# Batch prediction
sms_list = [
"A/c XX5274 debited by Rs. 1530. Total Bal Rs. 3303 CR.",
"Rs698 recharged! Enjoy Unlimited Calls. Valid 28 days.",
"Your ride has ended. Total fare Rs.234. Thanks for riding.",
"Rs.45000 credited to your account. Salary for March.",
]
results = clf(sms_list)
for sms, r in zip(sms_list, results):
print(f"{r['label']:>10} ({r['score']:.2f}) {sms[:55]}")
π§ Full 3-Layer Pipeline (Recommended)
The model alone gives you the category. For production use, combine with amount extraction and app detection:
import re
from transformers import pipeline
clf = pipeline("text-classification",
model="udayugale/expense-tracker-distilbert-v3")
def clean_sms(text):
"""Same cleaning used during training β must match exactly."""
t = text.lower()
t = re.sub(r'https?://\S+', 'URL', t)
t = re.sub(r'inr|rs\.?|βΉ', 'rs ', t)
t = re.sub(r'a/c\s*(?:xx|\*+)?\d+', 'ACNO', t, flags=re.IGNORECASE)
t = re.sub(r'\b\d{8,}\b', 'REFNO', t)
t = re.sub(r'[^\x00-\x7F]+', ' ', t)
return re.sub(r'\s+', ' ', t).strip()
APP_PATTERNS = [
("Swiggy", r"\bswiggy\b"), ("Zomato", r"\bzomato\b"),
("Blinkit", r"\bblinkit\b"), ("Zepto", r"\bzepto\b"),
("Uber", r"\buber\b"), ("Ola", r"\bola\s+ride\b|olacab"),
("Rapido", r"\brapido\b"), ("IRCTC", r"\birctc\b"),
("FASTag", r"\bfastag\b"), ("Amazon", r"\bamazon\b"),
("Flipkart", r"\bflipkart\b"), ("Netflix", r"\bnetflix\b"),
("Jio", r"\bjio\b"), ("Airtel", r"\bairtel\b"),
("PhonePe", r"\bphonepe\b"), ("Paytm", r"\bpaytm\b"),
("HDFC", r"\bhdfc\b"), ("SBI", r"\bsbi\b"),
("ICICI", r"\bicici\b"), ("Kotak", r"\bkotak\b"),
# Add more as new apps emerge
]
def analyze_sms(raw_text, sender=None):
"""
Full analysis: category + amount + transaction type + app.
Returns None for app if unknown β never forces a wrong answer.
"""
# Layer 1: ML classification
ml = clf(clean_sms(raw_text))[0]
# Layer 2: Amount extraction
amounts = [
float(a.replace(",", ""))
for a in re.findall(
r"(?:rs\.?\s*|inr\s*|βΉ\s*)(\d[\d,]*(?:\.\d{1,2})?)",
raw_text, re.IGNORECASE
)
]
txn_type = (
"credit" if re.search(r"\bcredited\b|\breceived\b|\bsalary\b", raw_text, re.I)
else "debit" if re.search(r"\bdebited\b|\bsent\b|\bpaid\b|\bcharged\b", raw_text, re.I)
else "recharge" if re.search(r"\brecharged\b", raw_text, re.I)
else "unknown"
)
bal = re.search(
r"(?:total bal|avl bal|balance)[:\s]*(?:rs\.?\s*|βΉ)?([\d,]+(?:\.\d{1,2})?)",
raw_text, re.I
)
# Layer 3: App detection (None if unknown β not forced)
app = None
for name, pat in APP_PATTERNS:
if re.search(pat, raw_text, re.IGNORECASE):
app = name
break
return {
"category": ml["label"],
"confidence": round(ml["score"], 4),
"amount": amounts[0] if amounts else None,
"type": txn_type,
"balance": float(bal.group(1).replace(",", "")) if bal else None,
"app": app, # None = unknown app, model still classified correctly
}
# Example
import json
result = analyze_sms(
"A/c XX5274 credited by Rs. 350.00 via UPI from RAHUL VILAS",
sender="AD-CENTBK-T"
)
print(json.dumps(result, indent=2))
# {
# "category": "Income",
# "confidence": 0.9734,
# "amount": 350.0,
# "type": "credit",
# "balance": null,
# "app": null
# }
π Training Details
Model Architecture
| Setting | Value |
|---|---|
| Base model | distilbert-base-uncased |
| Method | LoRA (Low-Rank Adaptation) β NOT QLoRA |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.1 |
| LoRA target layers | q_lin, k_lin, v_lin, out_lin |
| Trainable parameters | ~1.2M (1.8% of total 66M) |
| Frozen parameters | ~64.8M (base DistilBERT) |
Training Configuration
| Setting | Value |
|---|---|
| Epochs | 12 |
| Learning rate | 3e-4 (higher than standard β correct for LoRA) |
| Batch size | 32 |
| LR scheduler | Cosine decay |
| Warmup ratio | 0.06 |
| Weight decay | 0.01 |
| Loss function | Weighted CrossEntropy (minority classes weighted higher) |
| Max sequence length | 128 tokens |
| Optimizer | AdamW |
Why LoRA (not QLoRA)
QLoRA requires bitsandbytes 4-bit quantization which is incompatible with
DistilBERT's encoder architecture (dtype conflicts between uint8 base and fp32
classification head). LoRA gives the same accuracy improvement β the gain
comes from training fewer parameters, not from quantization.
π¦ Training Data
| Source | Type | Rows Used |
|---|---|---|
merged_final_dataset.csv (real Indian SMS) |
Real SMS from 93 users | ~4,200 |
engreemali/bank-transactions-sms-datasetss (Kaggle) |
Real Indian SMS 100K | ~1,200 |
kumarperiya/pan-indian-consumer-transaction-dataset (Kaggle) |
Structured β synthesized SMS | ~600 |
realistic_synthetic_sms.csv (ChatGPT generated) |
Synthetic SMS | ~3,200 |
| Pattern templates (programmatic) | Language pattern augmentation | ~1,400 |
Data Priority (highest to lowest)
- Real SMS from
merged_final_dataset.csv - Real SMS from Kaggle
engreemalidataset - Synthesized from Kaggle
kumarperiyadataset - ChatGPT synthetic data (fixed: dropped Income/Others, removed bad sender column)
- Programmatic pattern templates
π Key Design Decision: Pattern-Based Training
Problem with company-name training:
Training: "Swiggy order Rs.450" β Food
Training: "Zomato order Rs.340" β Food
β Model learns: Swiggy/Zomato = Food
β At inference: "NewApp order Rs.280" β FAILS (never saw NewApp)
Solution β train on language patterns:
Training: "food order rs 450 confirmed. delivery in 30 mins" β Food
Training: "order placed rs 340. out for delivery" β Food
β Model learns: delivery + order + confirmed = Food
β At inference: "NewApp food order Rs.280" β WORKS β
The training data uses contextual language patterns so the model generalizes to any food delivery app, ride service, or payment platform β including ones that didn't exist when the model was trained.
β οΈ Limitations
- Trained on Indian SMS format (INR / Rs. / UPI / NEFT / IMPS)
- May not generalize to non-Indian banking formats
Incomecategory may overlap withExpensefor peer-to-peer transfersOtherscategory (promotional SMS, OTPs, personal chats) is intentionally excluded- App detection (Layer 3) returns
nullfor unknown/new apps β this is by design
π Input Format
Works with raw Indian bank SMS text including:
- UPI payment alerts (
"A/c XX5274 debited by Rs. 1530 via UPI") - Bank debit/credit notifications (
"HDFC Bank: Rs 450 debited from a/c") - App payment confirmations (
"Your food order Rs.345 confirmed") - Recharge confirmations (
"Rs.239 recharged. Enjoy 2GB daily") - Ride/travel confirmations (
"Trip ended. Fare Rs.234 charged")
π€ Citation
If you use this model in your work:
@misc{expense-tracker-distilbert-v3,
title = {Expense Tracker β Indian SMS Classifier (DistilBERT + LoRA)},
author = {your_name},
year = {2025},
url = {https://huggingface.co/your_username/expense-tracker-distilbert-v3}
}
- Downloads last month
- -
Model tree for udayugale/expense-tracker-distilbert-v3
Base model
distilbert/distilbert-base-uncased