Upload README.md with huggingface_hub

2bbf676 verified 3 days ago

8.68 kB

	---
	license: apache-2.0
	tags:
	- text-classification
	- transformers
	- onnx
	- safetensors
	- transformers.js
	- distilbert
	- finance
	- transactions
	- english
	language:
	- en
	datasets:
	- DoDataThings/us-bank-transaction-categories-v2
	pipeline_tag: text-classification
	---

	# DistilBERT US Bank Transaction Classifier v2

	A fine-tuned DistilBERT model that classifies US bank transaction descriptions into 17 spending categories. Uses a `[debit]`/`[credit]` sign prefix to disambiguate transaction direction — a payroll deposit and a Venmo payment look similar in text but mean opposite things financially.

	Successor to [v1](https://huggingface.co/DoDataThings/distilbert-us-transaction-classifier), which classified on description text alone. v2 adds sign-aware input, expanded merchant coverage (500+), multi-format training across 8 bank statement structures, and PayPal as a first-class format.

	## How It Works

	The model takes a sign prefix + transaction description and outputs one of 17 categories:

	```
	Input: "[debit] STARBUCKS #1234 SAN FRANCISCO CA"
	Output: Restaurants (0.99)

	Input: "[credit] ACME CORP PAYROLL PPD ID: 123456789"
	Output: Income (1.00)

	Input: "[debit] CHASE CREDIT CRD AUTOPAY PPD ID: 9876543210"
	Output: Transfer (1.00)

	Input: "[debit] PreApproved Payment Bill User Payment: Netflix"
	Output: Subscription (1.00)
	```

	The sign prefix encodes the transaction direction from the cardholder's perspective:
	- `[debit]` — money left the account (purchases, payments out, fees)
	- `[credit]` — money entered the account (income, refunds, payments received)

	This is critical for distinguishing Income from Transfer. `[credit] VENMO CASHOUT` is Income (money arriving). `[debit] VENMO PAYMENT TO JOHN SMITH` is Transfer (money leaving). The description alone can't tell you which.

	## Categories (17)

	\| Category \| What it covers \|
	\|----------\|----------------\|
	\| Restaurants \| Fast food, sit-down, coffee, delivery, POS systems (TST, SQ, CLV*) \|
	\| Groceries \| Supermarkets, warehouse clubs, farmers markets, convenience stores \|
	\| Shopping \| Retail, online, department stores, pet stores, liquor stores, e-commerce marketplaces \|
	\| Transportation \| Gas, EV charging, rideshare, auto service, parking, tolls, DMV \|
	\| Entertainment \| Movies, events, gaming, gambling/sportsbooks \|
	\| Utilities \| Electric, internet, phone, water, waste/trash, solar \|
	\| Subscription \| Streaming, SaaS, AI tools, VPNs, social media premium, dating, business SaaS \|
	\| Healthcare \| Pharmacy, doctor, dentist, telehealth, vision, hospital \|
	\| Insurance \| Auto, home, health, life, home warranty \|
	\| Mortgage \| Bank, credit union, and fintech mortgage payments, escrow, principal \|
	\| Rent \| Property management companies, lease payments \|
	\| Travel \| Hotels, airlines, car rental, cruise lines, airport services \|
	\| Education \| Online courses, tutoring, books, tuition, certification \|
	\| Personal Care \| Salon, gym, beauty, spa, barber \|
	\| Transfer \| CC autopay, P2P sends, bank transfers, brokerage sweeps, fintech, BNPL, wire, ATM, cashier's checks \|
	\| Income \| Payroll, direct deposit, interest, refunds, government benefits, gig economy payouts \|
	\| Fees \| Bank fees, late fees, ATM surcharges, service charges \|

	### Account-Type-Implied Categories

	If you know the account type, some categories can be assigned without the model:

	\| Account Type \| Category \|
	\|---\|---\|
	\| Mortgage \| Mortgage \|
	\| Auto Loan \| Transportation \|
	\| Student Loan \| Education \|
	\| Personal Loan \| Transfer \|
	\| HELOC \| Transfer \|
	\| CD \| Income \|

	For checking, savings, and credit card accounts, use the model.

	## Training

	```
	Model: DistilBERT-base-uncased + LoRA (r=32, alpha=64)
	Dataset: 68,000 synthetic samples (4,000 per category)
	Trainable: 1.8M / 68.7M parameters (2.6%)
	Training: 20 epochs, best at epoch 16
	Validation: 99.9% accuracy (15 of 17 categories at 100%)
	```

	### Multi-Format Training

	The model is trained on 8 bank statement formats so it classifies correctly regardless of which bank produced the description:

	\| Format \| Example \| Source \|
	\|---\|---\|---\|
	\| Chase merchant \| `STARBUCKS #1234` \| Chase credit cards \|
	\| Chase ACH \| `INSTITUTION PURPOSE PPD ID: CODE` \| Chase checking \|
	\| Apple Card \| `MERCHANT ADDRESS CITY ZIP STATE USA` \| Apple Card \|
	\| PayPal native \| `PreApproved Payment Bill User Payment: MERCHANT` \| PayPal credit card \|
	\| PayPal prefix \| `PPMERCHANT`, `PYPLMERCHANT`, `PAYPAL *MERCHANT` \| Chase/other banks \|
	\| Capital One \| `Withdrawal from MERCHANT`, `Preauthorized Deposit from MERCHANT` \| Capital One \|
	\| Mercury \| `MERCHANT; Description` or just `MERCHANT` \| Mercury, neobanks \|
	\| POS prefix \| `SQ MERCHANT`, `TSTMERCHANT`, `CLV*MERCHANT` \| Square, Toast, Clover \|

	PayPal formats appear across all spending categories at meaningful rates, reflecting that people use PayPal cards at any merchant.

	### Honest Assessment

	The 99.9% validation accuracy is on synthetic data. On ~2,000 real transactions:

	- 96.1% of model classifications at 0.90+ confidence
	- < 0.5% below 0.50 confidence
	- 17 bank-category fallbacks (obscure merchants where the model defers)
	- Shopping is the weakest category due to overlap with Subscription and Groceries
	- Niche/unknown merchants may classify with lower confidence — use merchant rules for known edge cases

	## Usage

	### Python

	```python
	from transformers import pipeline

	classifier = pipeline("text-classification", model="DoDataThings/distilbert-us-transaction-classifier-v2")

	# Sign prefix required
	result = classifier("[debit] STARBUCKS #1234 SAN FRANCISCO CA")
	print(result) # [{'label': 'Restaurants', 'score': 0.99}]

	# Sign matters for ambiguous transactions
	classifier("[credit] VENMO CASHOUT PPD ID: 12345678")
	# [{'label': 'Income', 'score': 0.95}]

	classifier("[debit] VENMO PAYMENT TO JOHN SMITH")
	# [{'label': 'Transfer', 'score': 0.97}]

	# Works across all bank formats
	classifier("[debit] PreApproved Payment Bill User Payment: Netflix")
	# [{'label': 'Subscription', 'score': 1.00}]

	classifier("[debit] PP*SAFEWAY")
	# [{'label': 'Groceries', 'score': 1.00}]
	```

	### JavaScript (Transformers.js)

	```javascript
	const { pipeline } = require('@xenova/transformers');

	const classifier = await pipeline(
	'text-classification',
	'DoDataThings/distilbert-us-transaction-classifier-v2'
	);

	const result = await classifier('[debit] STARBUCKS #1234');
	// [{ label: 'Restaurants', score: 0.99 }]
	```

	An ONNX export is included in the `onnx/` subdirectory.

	## Design Decisions

	- Sign prefix, not account type. We considered passing account type (checking, credit, etc.) as a feature but concluded that sign alone provides the disambiguation signal. Account type is an upstream routing concern — it determines which classifier runs, not what the classifier outputs.
	- 17 model categories + 6 account-type categories. Mortgage is both a model category (for classifying mortgage descriptions on checking accounts) and an account-type-implied category (for mortgage account transactions). This serves both use cases — people with account type metadata and people with just transaction descriptions.
	- PayPal as a bank format, not a wrapper. PayPal is a card issuer. People use PayPal cards at restaurants, grocery stores, and everywhere else. The training data treats PayPal formats as first-class bank statement structures across all categories.
	- Synthetic data with real formats. The training data is synthetic but models real bank statement patterns — Chase ACH padding, Apple Card address formats, Capital One action prefixes, Mercury's minimal format. The generator is open source so you can extend it.

	## Training Data

	The dataset is published at [`DoDataThings/us-bank-transaction-categories-v2`](https://huggingface.co/datasets/DoDataThings/us-bank-transaction-categories-v2).

	## Generator

	The synthetic data generator is open source:

	```bash
	node scripts/generate-training-data.js --count 4000 # 4,000 per category
	```

	Available at [github.com/wnstnb/foliome](https://github.com/wnstnb/foliome).

	## Limitations

	- US bank formats only — Trained on Chase, Apple Card, PayPal, Capital One, Mercury, and US Bank patterns
	- Synthetic training data — May miss patterns from banks not represented
	- Shopping is the weakest category due to overlap with Subscription and Groceries
	- Sign prefix required — Passing raw descriptions without `[debit]`/`[credit]` will degrade accuracy
	- Not a standalone solution — Best results come from combining with merchant rules and account-type classification

	## License

	Apache 2.0