DistilBERT US Bank Transaction Classifier v2
A fine-tuned DistilBERT model that classifies US bank transaction descriptions into 17 spending categories. Uses a [debit]/[credit] sign prefix to disambiguate transaction direction β a payroll deposit and a Venmo payment look similar in text but mean opposite things financially.
Successor to v1, which classified on description text alone. v2 adds sign-aware input, expanded merchant coverage (500+), multi-format training across 8 bank statement structures, and PayPal as a first-class format.
How It Works
The model takes a sign prefix + transaction description and outputs one of 17 categories:
Input: "[debit] STARBUCKS #1234 SAN FRANCISCO CA"
Output: Restaurants (0.99)
Input: "[credit] ACME CORP PAYROLL PPD ID: 123456789"
Output: Income (1.00)
Input: "[debit] CHASE CREDIT CRD AUTOPAY PPD ID: 9876543210"
Output: Transfer (1.00)
Input: "[debit] PreApproved Payment Bill User Payment: Netflix"
Output: Subscription (1.00)
The sign prefix encodes the transaction direction from the cardholder's perspective:
[debit]β money left the account (purchases, payments out, fees)[credit]β money entered the account (income, refunds, payments received)
This is critical for distinguishing Income from Transfer. [credit] VENMO CASHOUT is Income (money arriving). [debit] VENMO PAYMENT TO JOHN SMITH is Transfer (money leaving). The description alone can't tell you which.
Categories (17)
| Category | What it covers |
|---|---|
| Restaurants | Fast food, sit-down, coffee, delivery, POS systems (TST*, SQ*, CLV*) |
| Groceries | Supermarkets, warehouse clubs, farmers markets, convenience stores |
| Shopping | Retail, online, department stores, pet stores, liquor stores, e-commerce marketplaces |
| Transportation | Gas, EV charging, rideshare, auto service, parking, tolls, DMV |
| Entertainment | Movies, events, gaming, gambling/sportsbooks |
| Utilities | Electric, internet, phone, water, waste/trash, solar |
| Subscription | Streaming, SaaS, AI tools, VPNs, social media premium, dating, business SaaS |
| Healthcare | Pharmacy, doctor, dentist, telehealth, vision, hospital |
| Insurance | Auto, home, health, life, home warranty |
| Mortgage | Bank, credit union, and fintech mortgage payments, escrow, principal |
| Rent | Property management companies, lease payments |
| Travel | Hotels, airlines, car rental, cruise lines, airport services |
| Education | Online courses, tutoring, books, tuition, certification |
| Personal Care | Salon, gym, beauty, spa, barber |
| Transfer | CC autopay, P2P sends, bank transfers, brokerage sweeps, fintech, BNPL, wire, ATM, cashier's checks |
| Income | Payroll, direct deposit, interest, refunds, government benefits, gig economy payouts |
| Fees | Bank fees, late fees, ATM surcharges, service charges |
Account-Type-Implied Categories
If you know the account type, some categories can be assigned without the model:
| Account Type | Category |
|---|---|
| Mortgage | Mortgage |
| Auto Loan | Transportation |
| Student Loan | Education |
| Personal Loan | Transfer |
| HELOC | Transfer |
| CD | Income |
For checking, savings, and credit card accounts, use the model.
Training
Model: DistilBERT-base-uncased + LoRA (r=32, alpha=64)
Dataset: 68,000 synthetic samples (4,000 per category)
Trainable: 1.8M / 68.7M parameters (2.6%)
Training: 20 epochs, best at epoch 16
Validation: 99.9% accuracy (15 of 17 categories at 100%)
Multi-Format Training
The model is trained on 8 bank statement formats so it classifies correctly regardless of which bank produced the description:
| Format | Example | Source |
|---|---|---|
| Chase merchant | STARBUCKS #1234 |
Chase credit cards |
| Chase ACH | INSTITUTION PURPOSE PPD ID: CODE |
Chase checking |
| Apple Card | MERCHANT ADDRESS CITY ZIP STATE USA |
Apple Card |
| PayPal native | PreApproved Payment Bill User Payment: MERCHANT |
PayPal credit card |
| PayPal prefix | PP*MERCHANT, PYPL*MERCHANT, PAYPAL *MERCHANT |
Chase/other banks |
| Capital One | Withdrawal from MERCHANT, Preauthorized Deposit from MERCHANT |
Capital One |
| Mercury | MERCHANT; Description or just MERCHANT |
Mercury, neobanks |
| POS prefix | SQ *MERCHANT, TST*MERCHANT, CLV*MERCHANT |
Square, Toast, Clover |
PayPal formats appear across all spending categories at meaningful rates, reflecting that people use PayPal cards at any merchant.
Honest Assessment
The 99.9% validation accuracy is on synthetic data. On ~2,000 real transactions:
- 96.1% of model classifications at 0.90+ confidence
- < 0.5% below 0.50 confidence
- 17 bank-category fallbacks (obscure merchants where the model defers)
- Shopping is the weakest category due to overlap with Subscription and Groceries
- Niche/unknown merchants may classify with lower confidence β use merchant rules for known edge cases
Usage
Python
from transformers import pipeline
classifier = pipeline("text-classification", model="DoDataThings/distilbert-us-transaction-classifier-v2")
# Sign prefix required
result = classifier("[debit] STARBUCKS #1234 SAN FRANCISCO CA")
print(result) # [{'label': 'Restaurants', 'score': 0.99}]
# Sign matters for ambiguous transactions
classifier("[credit] VENMO CASHOUT PPD ID: 12345678")
# [{'label': 'Income', 'score': 0.95}]
classifier("[debit] VENMO PAYMENT TO JOHN SMITH")
# [{'label': 'Transfer', 'score': 0.97}]
# Works across all bank formats
classifier("[debit] PreApproved Payment Bill User Payment: Netflix")
# [{'label': 'Subscription', 'score': 1.00}]
classifier("[debit] PP*SAFEWAY")
# [{'label': 'Groceries', 'score': 1.00}]
JavaScript (Transformers.js)
const { pipeline } = require('@xenova/transformers');
const classifier = await pipeline(
'text-classification',
'DoDataThings/distilbert-us-transaction-classifier-v2'
);
const result = await classifier('[debit] STARBUCKS #1234');
// [{ label: 'Restaurants', score: 0.99 }]
An ONNX export is included in the onnx/ subdirectory.
Design Decisions
- Sign prefix, not account type. We considered passing account type (checking, credit, etc.) as a feature but concluded that sign alone provides the disambiguation signal. Account type is an upstream routing concern β it determines which classifier runs, not what the classifier outputs.
- 17 model categories + 6 account-type categories. Mortgage is both a model category (for classifying mortgage descriptions on checking accounts) and an account-type-implied category (for mortgage account transactions). This serves both use cases β people with account type metadata and people with just transaction descriptions.
- PayPal as a bank format, not a wrapper. PayPal is a card issuer. People use PayPal cards at restaurants, grocery stores, and everywhere else. The training data treats PayPal formats as first-class bank statement structures across all categories.
- Synthetic data with real formats. The training data is synthetic but models real bank statement patterns β Chase ACH padding, Apple Card address formats, Capital One action prefixes, Mercury's minimal format. The generator is open source so you can extend it.
Training Data
The dataset is published at DoDataThings/us-bank-transaction-categories-v2.
Generator
The synthetic data generator is open source:
node scripts/generate-training-data.js --count 4000 # 4,000 per category
Available at github.com/wnstnb/foliome.
Limitations
- US bank formats only β Trained on Chase, Apple Card, PayPal, Capital One, Mercury, and US Bank patterns
- Synthetic training data β May miss patterns from banks not represented
- Shopping is the weakest category due to overlap with Subscription and Groceries
- Sign prefix required β Passing raw descriptions without
[debit]/[credit]will degrade accuracy - Not a standalone solution β Best results come from combining with merchant rules and account-type classification
License
Apache 2.0
- Downloads last month
- 33