π³ Expense Tracker β DistilBERT LoRA v2 (4 Data Sources)
π¦ Training Data
| Source |
Type |
Rows |
| engreemali/bank-transactions-sms-datasetss |
Real Indian SMS (cleaned) |
~1,200 |
| kumarperiya/pan-indian-consumer-transaction-dataset |
Structured β synthetic SMS |
~600 |
| ChatGPT synthetic_sms_5000 (fixed) |
Synthetic (augmented) |
~3,300 |
| ChatGPT realistic_synthetic_sms (fixed) |
Synthetic (realistic) |
~3,200 |
π·οΈ Categories
| ID |
Category |
| 0 |
Education |
| 1 |
Entertainment |
| 2 |
Food |
| 3 |
Healthcare |
| 4 |
Shopping |
| 5 |
Transport |
| 6 |
Utilities |
π Usage
from transformers import pipeline
clf = pipeline('text-classification', model='udayugale/expense-tracker-distilbert-lora-v2')
print(clf('Netmeds medicine order rs 350 confirmed. Delivery in 2 hrs'))
π§ Fixes Applied to ChatGPT Data
- Dropped
Income and Others labels (not in expense categories)
- Mapped
Bills β Utilities
- Dropped
sender column from File 2 (2,376 sender-label mismatches)
- Augmented short texts (< 7 words) with bank SMS context wrappers