πŸ’³ Expense Tracker β€” Indian SMS Classifier

A fine-tuned distilbert-base-uncased model that classifies Indian bank/UPI/payment SMS messages into expense categories. Built with LoRA (Low-Rank Adaptation) for better generalization on small datasets.


🏷️ Categories

ID Category Description Example SMS
0 Bills Utility bills, subscriptions, EMI "Electricity bill Rs.1340 paid"
1 Expense Generic bank debits, UPI transfers "A/c debited by Rs.1530. Bal Rs.3303."
2 Food Food delivery, restaurant orders "Food order Rs.345 confirmed. Delivery 20 mins."
3 Income Credits, salary, refunds, cashback "Rs.45000 credited. Salary for March."
4 Recharge Mobile/DTH recharge "Rs.239 recharged. Validity 28 days."
5 Transport Rides, flights, trains, toll "Ride completed. Rs.234 charged."

πŸš€ Quick Start

from transformers import pipeline

clf = pipeline("text-classification",
               model="udayugale/expense-tracker-distilbert-v3")

# Single SMS
result = clf("Food order Rs.345 confirmed. Delivery in 20 mins. Enjoy your meal!")
print(result)
# [{'label': 'Food', 'score': 0.97}]

# Batch prediction
sms_list = [
    "A/c XX5274 debited by Rs. 1530. Total Bal Rs. 3303 CR.",
    "Rs698 recharged! Enjoy Unlimited Calls. Valid 28 days.",
    "Your ride has ended. Total fare Rs.234. Thanks for riding.",
    "Rs.45000 credited to your account. Salary for March.",
]
results = clf(sms_list)
for sms, r in zip(sms_list, results):
    print(f"{r['label']:>10} ({r['score']:.2f})  {sms[:55]}")

πŸ”§ Full 3-Layer Pipeline (Recommended)

The model alone gives you the category. For production use, combine with amount extraction and app detection:

import re
from transformers import pipeline

clf = pipeline("text-classification",
               model="udayugale/expense-tracker-distilbert-v3")

def clean_sms(text):
    """Same cleaning used during training β€” must match exactly."""
    t = text.lower()
    t = re.sub(r'https?://\S+', 'URL', t)
    t = re.sub(r'inr|rs\.?|β‚Ή', 'rs ', t)
    t = re.sub(r'a/c\s*(?:xx|\*+)?\d+', 'ACNO', t, flags=re.IGNORECASE)
    t = re.sub(r'\b\d{8,}\b', 'REFNO', t)
    t = re.sub(r'[^\x00-\x7F]+', ' ', t)
    return re.sub(r'\s+', ' ', t).strip()

APP_PATTERNS = [
    ("Swiggy",    r"\bswiggy\b"),    ("Zomato",  r"\bzomato\b"),
    ("Blinkit",   r"\bblinkit\b"),   ("Zepto",   r"\bzepto\b"),
    ("Uber",      r"\buber\b"),      ("Ola",     r"\bola\s+ride\b|olacab"),
    ("Rapido",    r"\brapido\b"),    ("IRCTC",   r"\birctc\b"),
    ("FASTag",    r"\bfastag\b"),    ("Amazon",  r"\bamazon\b"),
    ("Flipkart",  r"\bflipkart\b"),  ("Netflix", r"\bnetflix\b"),
    ("Jio",       r"\bjio\b"),       ("Airtel",  r"\bairtel\b"),
    ("PhonePe",   r"\bphonepe\b"),   ("Paytm",   r"\bpaytm\b"),
    ("HDFC",      r"\bhdfc\b"),      ("SBI",     r"\bsbi\b"),
    ("ICICI",     r"\bicici\b"),     ("Kotak",   r"\bkotak\b"),
    # Add more as new apps emerge
]

def analyze_sms(raw_text, sender=None):
    """
    Full analysis: category + amount + transaction type + app.
    Returns None for app if unknown β€” never forces a wrong answer.
    """
    # Layer 1: ML classification
    ml = clf(clean_sms(raw_text))[0]

    # Layer 2: Amount extraction
    amounts = [
        float(a.replace(",", ""))
        for a in re.findall(
            r"(?:rs\.?\s*|inr\s*|β‚Ή\s*)(\d[\d,]*(?:\.\d{1,2})?)",
            raw_text, re.IGNORECASE
        )
    ]
    txn_type = (
        "credit"   if re.search(r"\bcredited\b|\breceived\b|\bsalary\b", raw_text, re.I)
        else "debit"    if re.search(r"\bdebited\b|\bsent\b|\bpaid\b|\bcharged\b", raw_text, re.I)
        else "recharge" if re.search(r"\brecharged\b", raw_text, re.I)
        else "unknown"
    )
    bal = re.search(
        r"(?:total bal|avl bal|balance)[:\s]*(?:rs\.?\s*|β‚Ή)?([\d,]+(?:\.\d{1,2})?)",
        raw_text, re.I
    )

    # Layer 3: App detection (None if unknown β€” not forced)
    app = None
    for name, pat in APP_PATTERNS:
        if re.search(pat, raw_text, re.IGNORECASE):
            app = name
            break

    return {
        "category":   ml["label"],
        "confidence": round(ml["score"], 4),
        "amount":     amounts[0] if amounts else None,
        "type":       txn_type,
        "balance":    float(bal.group(1).replace(",", "")) if bal else None,
        "app":        app,   # None = unknown app, model still classified correctly
    }

# Example
import json
result = analyze_sms(
    "A/c XX5274 credited by Rs. 350.00 via UPI from RAHUL VILAS",
    sender="AD-CENTBK-T"
)
print(json.dumps(result, indent=2))
# {
#   "category":   "Income",
#   "confidence": 0.9734,
#   "amount":     350.0,
#   "type":       "credit",
#   "balance":    null,
#   "app":        null
# }

πŸ“Š Training Details

Model Architecture

Setting Value
Base model distilbert-base-uncased
Method LoRA (Low-Rank Adaptation) β€” NOT QLoRA
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.1
LoRA target layers q_lin, k_lin, v_lin, out_lin
Trainable parameters ~1.2M (1.8% of total 66M)
Frozen parameters ~64.8M (base DistilBERT)

Training Configuration

Setting Value
Epochs 12
Learning rate 3e-4 (higher than standard β€” correct for LoRA)
Batch size 32
LR scheduler Cosine decay
Warmup ratio 0.06
Weight decay 0.01
Loss function Weighted CrossEntropy (minority classes weighted higher)
Max sequence length 128 tokens
Optimizer AdamW

Why LoRA (not QLoRA)

QLoRA requires bitsandbytes 4-bit quantization which is incompatible with DistilBERT's encoder architecture (dtype conflicts between uint8 base and fp32 classification head). LoRA gives the same accuracy improvement β€” the gain comes from training fewer parameters, not from quantization.


πŸ“¦ Training Data

Source Type Rows Used
merged_final_dataset.csv (real Indian SMS) Real SMS from 93 users ~4,200
engreemali/bank-transactions-sms-datasetss (Kaggle) Real Indian SMS 100K ~1,200
kumarperiya/pan-indian-consumer-transaction-dataset (Kaggle) Structured β†’ synthesized SMS ~600
realistic_synthetic_sms.csv (ChatGPT generated) Synthetic SMS ~3,200
Pattern templates (programmatic) Language pattern augmentation ~1,400

Data Priority (highest to lowest)

  1. Real SMS from merged_final_dataset.csv
  2. Real SMS from Kaggle engreemali dataset
  3. Synthesized from Kaggle kumarperiya dataset
  4. ChatGPT synthetic data (fixed: dropped Income/Others, removed bad sender column)
  5. Programmatic pattern templates

πŸ”‘ Key Design Decision: Pattern-Based Training

Problem with company-name training:

Training: "Swiggy order Rs.450"  β†’ Food
Training: "Zomato order Rs.340"  β†’ Food
β†’ Model learns: Swiggy/Zomato = Food
β†’ At inference: "NewApp order Rs.280" β†’ FAILS (never saw NewApp)

Solution β€” train on language patterns:

Training: "food order rs 450 confirmed. delivery in 30 mins" β†’ Food
Training: "order placed rs 340. out for delivery"            β†’ Food
β†’ Model learns: delivery + order + confirmed = Food
β†’ At inference: "NewApp food order Rs.280" β†’ WORKS βœ…

The training data uses contextual language patterns so the model generalizes to any food delivery app, ride service, or payment platform β€” including ones that didn't exist when the model was trained.


⚠️ Limitations

  • Trained on Indian SMS format (INR / Rs. / UPI / NEFT / IMPS)
  • May not generalize to non-Indian banking formats
  • Income category may overlap with Expense for peer-to-peer transfers
  • Others category (promotional SMS, OTPs, personal chats) is intentionally excluded
  • App detection (Layer 3) returns null for unknown/new apps β€” this is by design

πŸ“‹ Input Format

Works with raw Indian bank SMS text including:

  • UPI payment alerts ("A/c XX5274 debited by Rs. 1530 via UPI")
  • Bank debit/credit notifications ("HDFC Bank: Rs 450 debited from a/c")
  • App payment confirmations ("Your food order Rs.345 confirmed")
  • Recharge confirmations ("Rs.239 recharged. Enjoy 2GB daily")
  • Ride/travel confirmations ("Trip ended. Fare Rs.234 charged")

🀝 Citation

If you use this model in your work:

@misc{expense-tracker-distilbert-v3,
  title  = {Expense Tracker β€” Indian SMS Classifier (DistilBERT + LoRA)},
  author = {your_name},
  year   = {2025},
  url    = {https://huggingface.co/your_username/expense-tracker-distilbert-v3}
}
Downloads last month
-
Safetensors
Model size
67M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for udayugale/expense-tracker-distilbert-v3

Adapter
(371)
this model