💳 Expense Tracker — Indian SMS Classifier

A fine-tuned distilbert-base-uncased model that classifies Indian bank/UPI/payment SMS messages into expense categories. Built with LoRA (Low-Rank Adaptation) for better generalization on small datasets.

🏷️ Categories

ID	Category	Description	Example SMS
0	Bills	Utility bills, subscriptions, EMI	`"Electricity bill Rs.1340 paid"`
1	Expense	Generic bank debits, UPI transfers	`"A/c debited by Rs.1530. Bal Rs.3303."`
2	Food	Food delivery, restaurant orders	`"Food order Rs.345 confirmed. Delivery 20 mins."`
3	Income	Credits, salary, refunds, cashback	`"Rs.45000 credited. Salary for March."`
4	Recharge	Mobile/DTH recharge	`"Rs.239 recharged. Validity 28 days."`
5	Transport	Rides, flights, trains, toll	`"Ride completed. Rs.234 charged."`

🚀 Quick Start

from transformers import pipeline

clf = pipeline("text-classification",
               model="udayugale/expense-tracker-distilbert-v3")

# Single SMS
result = clf("Food order Rs.345 confirmed. Delivery in 20 mins. Enjoy your meal!")
print(result)
# [{'label': 'Food', 'score': 0.97}]

# Batch prediction
sms_list = [
    "A/c XX5274 debited by Rs. 1530. Total Bal Rs. 3303 CR.",
    "Rs698 recharged! Enjoy Unlimited Calls. Valid 28 days.",
    "Your ride has ended. Total fare Rs.234. Thanks for riding.",
    "Rs.45000 credited to your account. Salary for March.",
]
results = clf(sms_list)
for sms, r in zip(sms_list, results):
    print(f"{r['label']:>10} ({r['score']:.2f})  {sms[:55]}")

🔧 Full 3-Layer Pipeline (Recommended)

The model alone gives you the category. For production use, combine with amount extraction and app detection:

import re
from transformers import pipeline

clf = pipeline("text-classification",
               model="udayugale/expense-tracker-distilbert-v3")

def clean_sms(text):
    """Same cleaning used during training — must match exactly."""
    t = text.lower()
    t = re.sub(r'https?://\S+', 'URL', t)
    t = re.sub(r'inr|rs\.?|₹', 'rs ', t)
    t = re.sub(r'a/c\s*(?:xx|\*+)?\d+', 'ACNO', t, flags=re.IGNORECASE)
    t = re.sub(r'\b\d{8,}\b', 'REFNO', t)
    t = re.sub(r'[^\x00-\x7F]+', ' ', t)
    return re.sub(r'\s+', ' ', t).strip()

APP_PATTERNS = [
    ("Swiggy",    r"\bswiggy\b"),    ("Zomato",  r"\bzomato\b"),
    ("Blinkit",   r"\bblinkit\b"),   ("Zepto",   r"\bzepto\b"),
    ("Uber",      r"\buber\b"),      ("Ola",     r"\bola\s+ride\b|olacab"),
    ("Rapido",    r"\brapido\b"),    ("IRCTC",   r"\birctc\b"),
    ("FASTag",    r"\bfastag\b"),    ("Amazon",  r"\bamazon\b"),
    ("Flipkart",  r"\bflipkart\b"),  ("Netflix", r"\bnetflix\b"),
    ("Jio",       r"\bjio\b"),       ("Airtel",  r"\bairtel\b"),
    ("PhonePe",   r"\bphonepe\b"),   ("Paytm",   r"\bpaytm\b"),
    ("HDFC",      r"\bhdfc\b"),      ("SBI",     r"\bsbi\b"),
    ("ICICI",     r"\bicici\b"),     ("Kotak",   r"\bkotak\b"),
    # Add more as new apps emerge
]

def analyze_sms(raw_text, sender=None):
    """
    Full analysis: category + amount + transaction type + app.
    Returns None for app if unknown — never forces a wrong answer.
    """
    # Layer 1: ML classification
    ml = clf(clean_sms(raw_text))[0]

    # Layer 2: Amount extraction
    amounts = [
        float(a.replace(",", ""))
        for a in re.findall(
            r"(?:rs\.?\s*|inr\s*|₹\s*)(\d[\d,]*(?:\.\d{1,2})?)",
            raw_text, re.IGNORECASE
        )
    ]
    txn_type = (
        "credit"   if re.search(r"\bcredited\b|\breceived\b|\bsalary\b", raw_text, re.I)
        else "debit"    if re.search(r"\bdebited\b|\bsent\b|\bpaid\b|\bcharged\b", raw_text, re.I)
        else "recharge" if re.search(r"\brecharged\b", raw_text, re.I)
        else "unknown"
    )
    bal = re.search(
        r"(?:total bal|avl bal|balance)[:\s]*(?:rs\.?\s*|₹)?([\d,]+(?:\.\d{1,2})?)",
        raw_text, re.I
    )

    # Layer 3: App detection (None if unknown — not forced)
    app = None
    for name, pat in APP_PATTERNS:
        if re.search(pat, raw_text, re.IGNORECASE):
            app = name
            break

    return {
        "category":   ml["label"],
        "confidence": round(ml["score"], 4),
        "amount":     amounts[0] if amounts else None,
        "type":       txn_type,
        "balance":    float(bal.group(1).replace(",", "")) if bal else None,
        "app":        app,   # None = unknown app, model still classified correctly
    }

# Example
import json
result = analyze_sms(
    "A/c XX5274 credited by Rs. 350.00 via UPI from RAHUL VILAS",
    sender="AD-CENTBK-T"
)
print(json.dumps(result, indent=2))
# {
#   "category":   "Income",
#   "confidence": 0.9734,
#   "amount":     350.0,
#   "type":       "credit",
#   "balance":    null,
#   "app":        null
# }

📊 Training Details

Model Architecture

Setting	Value
Base model	`distilbert-base-uncased`
Method	LoRA (Low-Rank Adaptation) — NOT QLoRA
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.1
LoRA target layers	`q_lin`, `k_lin`, `v_lin`, `out_lin`
Trainable parameters	~1.2M (1.8% of total 66M)
Frozen parameters	~64.8M (base DistilBERT)

Training Configuration

Setting	Value
Epochs	12
Learning rate	3e-4 (higher than standard — correct for LoRA)
Batch size	32
LR scheduler	Cosine decay
Warmup ratio	0.06
Weight decay	0.01
Loss function	Weighted CrossEntropy (minority classes weighted higher)
Max sequence length	128 tokens
Optimizer	AdamW

Why LoRA (not QLoRA)

QLoRA requires bitsandbytes 4-bit quantization which is incompatible with DistilBERT's encoder architecture (dtype conflicts between uint8 base and fp32 classification head). LoRA gives the same accuracy improvement — the gain comes from training fewer parameters, not from quantization.

📦 Training Data

Source	Type	Rows Used
`merged_final_dataset.csv` (real Indian SMS)	Real SMS from 93 users	~4,200
`engreemali/bank-transactions-sms-datasetss` (Kaggle)	Real Indian SMS 100K	~1,200
`kumarperiya/pan-indian-consumer-transaction-dataset` (Kaggle)	Structured → synthesized SMS	~600
`realistic_synthetic_sms.csv` (ChatGPT generated)	Synthetic SMS	~3,200
Pattern templates (programmatic)	Language pattern augmentation	~1,400

Data Priority (highest to lowest)

Real SMS from merged_final_dataset.csv
Real SMS from Kaggle engreemali dataset
Synthesized from Kaggle kumarperiya dataset
ChatGPT synthetic data (fixed: dropped Income/Others, removed bad sender column)
Programmatic pattern templates

🔑 Key Design Decision: Pattern-Based Training

Problem with company-name training:

Training: "Swiggy order Rs.450"  → Food
Training: "Zomato order Rs.340"  → Food
→ Model learns: Swiggy/Zomato = Food
→ At inference: "NewApp order Rs.280" → FAILS (never saw NewApp)

Solution — train on language patterns:

Training: "food order rs 450 confirmed. delivery in 30 mins" → Food
Training: "order placed rs 340. out for delivery"            → Food
→ Model learns: delivery + order + confirmed = Food
→ At inference: "NewApp food order Rs.280" → WORKS ✅

The training data uses contextual language patterns so the model generalizes to any food delivery app, ride service, or payment platform — including ones that didn't exist when the model was trained.

⚠️ Limitations

Trained on Indian SMS format (INR / Rs. / UPI / NEFT / IMPS)
May not generalize to non-Indian banking formats
Income category may overlap with Expense for peer-to-peer transfers
Others category (promotional SMS, OTPs, personal chats) is intentionally excluded
App detection (Layer 3) returns null for unknown/new apps — this is by design

📋 Input Format

Works with raw Indian bank SMS text including:

UPI payment alerts ("A/c XX5274 debited by Rs. 1530 via UPI")
Bank debit/credit notifications ("HDFC Bank: Rs 450 debited from a/c")
App payment confirmations ("Your food order Rs.345 confirmed")
Recharge confirmations ("Rs.239 recharged. Enjoy 2GB daily")
Ride/travel confirmations ("Trip ended. Fare Rs.234 charged")

🤝 Citation

If you use this model in your work:

@misc{expense-tracker-distilbert-v3,
  title  = {Expense Tracker — Indian SMS Classifier (DistilBERT + LoRA)},
  author = {your_name},
  year   = {2025},
  url    = {https://huggingface.co/your_username/expense-tracker-distilbert-v3}
}

Downloads last month: -

Safetensors

Model size

67M params

Tensor type

F32

Model tree for udayugale/expense-tracker-distilbert-v3

Base model

distilbert/distilbert-base-uncased

Adapter

(380)

this model