DoDataThings's picture
Upload README.md with huggingface_hub
2bbf676 verified
---
license: apache-2.0
tags:
- text-classification
- transformers
- onnx
- safetensors
- transformers.js
- distilbert
- finance
- transactions
- english
language:
- en
datasets:
- DoDataThings/us-bank-transaction-categories-v2
pipeline_tag: text-classification
---
# DistilBERT US Bank Transaction Classifier v2
A fine-tuned DistilBERT model that classifies US bank transaction descriptions into 17 spending categories. Uses a `[debit]`/`[credit]` sign prefix to disambiguate transaction direction — a payroll deposit and a Venmo payment look similar in text but mean opposite things financially.
**Successor to [v1](https://huggingface.co/DoDataThings/distilbert-us-transaction-classifier)**, which classified on description text alone. v2 adds sign-aware input, expanded merchant coverage (500+), multi-format training across 8 bank statement structures, and PayPal as a first-class format.
## How It Works
The model takes a sign prefix + transaction description and outputs one of 17 categories:
```
Input: "[debit] STARBUCKS #1234 SAN FRANCISCO CA"
Output: Restaurants (0.99)
Input: "[credit] ACME CORP PAYROLL PPD ID: 123456789"
Output: Income (1.00)
Input: "[debit] CHASE CREDIT CRD AUTOPAY PPD ID: 9876543210"
Output: Transfer (1.00)
Input: "[debit] PreApproved Payment Bill User Payment: Netflix"
Output: Subscription (1.00)
```
The sign prefix encodes the transaction direction from the cardholder's perspective:
- `[debit]` — money left the account (purchases, payments out, fees)
- `[credit]` — money entered the account (income, refunds, payments received)
This is critical for distinguishing Income from Transfer. `[credit] VENMO CASHOUT` is Income (money arriving). `[debit] VENMO PAYMENT TO JOHN SMITH` is Transfer (money leaving). The description alone can't tell you which.
## Categories (17)
| Category | What it covers |
|----------|----------------|
| Restaurants | Fast food, sit-down, coffee, delivery, POS systems (TST*, SQ*, CLV*) |
| Groceries | Supermarkets, warehouse clubs, farmers markets, convenience stores |
| Shopping | Retail, online, department stores, pet stores, liquor stores, e-commerce marketplaces |
| Transportation | Gas, EV charging, rideshare, auto service, parking, tolls, DMV |
| Entertainment | Movies, events, gaming, gambling/sportsbooks |
| Utilities | Electric, internet, phone, water, waste/trash, solar |
| Subscription | Streaming, SaaS, AI tools, VPNs, social media premium, dating, business SaaS |
| Healthcare | Pharmacy, doctor, dentist, telehealth, vision, hospital |
| Insurance | Auto, home, health, life, home warranty |
| Mortgage | Bank, credit union, and fintech mortgage payments, escrow, principal |
| Rent | Property management companies, lease payments |
| Travel | Hotels, airlines, car rental, cruise lines, airport services |
| Education | Online courses, tutoring, books, tuition, certification |
| Personal Care | Salon, gym, beauty, spa, barber |
| Transfer | CC autopay, P2P sends, bank transfers, brokerage sweeps, fintech, BNPL, wire, ATM, cashier's checks |
| Income | Payroll, direct deposit, interest, refunds, government benefits, gig economy payouts |
| Fees | Bank fees, late fees, ATM surcharges, service charges |
### Account-Type-Implied Categories
If you know the account type, some categories can be assigned without the model:
| Account Type | Category |
|---|---|
| Mortgage | Mortgage |
| Auto Loan | Transportation |
| Student Loan | Education |
| Personal Loan | Transfer |
| HELOC | Transfer |
| CD | Income |
For checking, savings, and credit card accounts, use the model.
## Training
```
Model: DistilBERT-base-uncased + LoRA (r=32, alpha=64)
Dataset: 68,000 synthetic samples (4,000 per category)
Trainable: 1.8M / 68.7M parameters (2.6%)
Training: 20 epochs, best at epoch 16
Validation: 99.9% accuracy (15 of 17 categories at 100%)
```
### Multi-Format Training
The model is trained on 8 bank statement formats so it classifies correctly regardless of which bank produced the description:
| Format | Example | Source |
|---|---|---|
| Chase merchant | `STARBUCKS #1234` | Chase credit cards |
| Chase ACH | `INSTITUTION PURPOSE PPD ID: CODE` | Chase checking |
| Apple Card | `MERCHANT ADDRESS CITY ZIP STATE USA` | Apple Card |
| PayPal native | `PreApproved Payment Bill User Payment: MERCHANT` | PayPal credit card |
| PayPal prefix | `PP*MERCHANT`, `PYPL*MERCHANT`, `PAYPAL *MERCHANT` | Chase/other banks |
| Capital One | `Withdrawal from MERCHANT`, `Preauthorized Deposit from MERCHANT` | Capital One |
| Mercury | `MERCHANT; Description` or just `MERCHANT` | Mercury, neobanks |
| POS prefix | `SQ *MERCHANT`, `TST*MERCHANT`, `CLV*MERCHANT` | Square, Toast, Clover |
PayPal formats appear across all spending categories at meaningful rates, reflecting that people use PayPal cards at any merchant.
### Honest Assessment
The 99.9% validation accuracy is on synthetic data. On ~2,000 real transactions:
- **96.1% of model classifications at 0.90+ confidence**
- **< 0.5% below 0.50 confidence**
- 17 bank-category fallbacks (obscure merchants where the model defers)
- Shopping is the weakest category due to overlap with Subscription and Groceries
- Niche/unknown merchants may classify with lower confidence — use merchant rules for known edge cases
## Usage
### Python
```python
from transformers import pipeline
classifier = pipeline("text-classification", model="DoDataThings/distilbert-us-transaction-classifier-v2")
# Sign prefix required
result = classifier("[debit] STARBUCKS #1234 SAN FRANCISCO CA")
print(result) # [{'label': 'Restaurants', 'score': 0.99}]
# Sign matters for ambiguous transactions
classifier("[credit] VENMO CASHOUT PPD ID: 12345678")
# [{'label': 'Income', 'score': 0.95}]
classifier("[debit] VENMO PAYMENT TO JOHN SMITH")
# [{'label': 'Transfer', 'score': 0.97}]
# Works across all bank formats
classifier("[debit] PreApproved Payment Bill User Payment: Netflix")
# [{'label': 'Subscription', 'score': 1.00}]
classifier("[debit] PP*SAFEWAY")
# [{'label': 'Groceries', 'score': 1.00}]
```
### JavaScript (Transformers.js)
```javascript
const { pipeline } = require('@xenova/transformers');
const classifier = await pipeline(
'text-classification',
'DoDataThings/distilbert-us-transaction-classifier-v2'
);
const result = await classifier('[debit] STARBUCKS #1234');
// [{ label: 'Restaurants', score: 0.99 }]
```
An ONNX export is included in the `onnx/` subdirectory.
## Design Decisions
- **Sign prefix, not account type.** We considered passing account type (checking, credit, etc.) as a feature but concluded that sign alone provides the disambiguation signal. Account type is an upstream routing concern — it determines which classifier runs, not what the classifier outputs.
- **17 model categories + 6 account-type categories.** Mortgage is both a model category (for classifying mortgage descriptions on checking accounts) and an account-type-implied category (for mortgage account transactions). This serves both use cases — people with account type metadata and people with just transaction descriptions.
- **PayPal as a bank format, not a wrapper.** PayPal is a card issuer. People use PayPal cards at restaurants, grocery stores, and everywhere else. The training data treats PayPal formats as first-class bank statement structures across all categories.
- **Synthetic data with real formats.** The training data is synthetic but models real bank statement patterns — Chase ACH padding, Apple Card address formats, Capital One action prefixes, Mercury's minimal format. The generator is open source so you can extend it.
## Training Data
The dataset is published at [`DoDataThings/us-bank-transaction-categories-v2`](https://huggingface.co/datasets/DoDataThings/us-bank-transaction-categories-v2).
## Generator
The synthetic data generator is open source:
```bash
node scripts/generate-training-data.js --count 4000 # 4,000 per category
```
Available at [github.com/wnstnb/foliome](https://github.com/wnstnb/foliome).
## Limitations
- **US bank formats only** — Trained on Chase, Apple Card, PayPal, Capital One, Mercury, and US Bank patterns
- **Synthetic training data** — May miss patterns from banks not represented
- **Shopping is the weakest category** due to overlap with Subscription and Groceries
- **Sign prefix required** — Passing raw descriptions without `[debit]`/`[credit]` will degrade accuracy
- **Not a standalone solution** — Best results come from combining with merchant rules and account-type classification
## License
Apache 2.0