--- license: apache-2.0 tags: - text-classification - transformers - onnx - safetensors - transformers.js - distilbert - finance - transactions - english language: - en datasets: - DoDataThings/us-bank-transaction-categories-v2 pipeline_tag: text-classification --- # DistilBERT US Bank Transaction Classifier v2 A fine-tuned DistilBERT model that classifies US bank transaction descriptions into 17 spending categories. Uses a `[debit]`/`[credit]` sign prefix to disambiguate transaction direction — a payroll deposit and a Venmo payment look similar in text but mean opposite things financially. **Successor to [v1](https://huggingface.co/DoDataThings/distilbert-us-transaction-classifier)**, which classified on description text alone. v2 adds sign-aware input, expanded merchant coverage (500+), multi-format training across 8 bank statement structures, and PayPal as a first-class format. ## How It Works The model takes a sign prefix + transaction description and outputs one of 17 categories: ``` Input: "[debit] STARBUCKS #1234 SAN FRANCISCO CA" Output: Restaurants (0.99) Input: "[credit] ACME CORP PAYROLL PPD ID: 123456789" Output: Income (1.00) Input: "[debit] CHASE CREDIT CRD AUTOPAY PPD ID: 9876543210" Output: Transfer (1.00) Input: "[debit] PreApproved Payment Bill User Payment: Netflix" Output: Subscription (1.00) ``` The sign prefix encodes the transaction direction from the cardholder's perspective: - `[debit]` — money left the account (purchases, payments out, fees) - `[credit]` — money entered the account (income, refunds, payments received) This is critical for distinguishing Income from Transfer. `[credit] VENMO CASHOUT` is Income (money arriving). `[debit] VENMO PAYMENT TO JOHN SMITH` is Transfer (money leaving). The description alone can't tell you which. ## Categories (17) | Category | What it covers | |----------|----------------| | Restaurants | Fast food, sit-down, coffee, delivery, POS systems (TST*, SQ*, CLV*) | | Groceries | Supermarkets, warehouse clubs, farmers markets, convenience stores | | Shopping | Retail, online, department stores, pet stores, liquor stores, e-commerce marketplaces | | Transportation | Gas, EV charging, rideshare, auto service, parking, tolls, DMV | | Entertainment | Movies, events, gaming, gambling/sportsbooks | | Utilities | Electric, internet, phone, water, waste/trash, solar | | Subscription | Streaming, SaaS, AI tools, VPNs, social media premium, dating, business SaaS | | Healthcare | Pharmacy, doctor, dentist, telehealth, vision, hospital | | Insurance | Auto, home, health, life, home warranty | | Mortgage | Bank, credit union, and fintech mortgage payments, escrow, principal | | Rent | Property management companies, lease payments | | Travel | Hotels, airlines, car rental, cruise lines, airport services | | Education | Online courses, tutoring, books, tuition, certification | | Personal Care | Salon, gym, beauty, spa, barber | | Transfer | CC autopay, P2P sends, bank transfers, brokerage sweeps, fintech, BNPL, wire, ATM, cashier's checks | | Income | Payroll, direct deposit, interest, refunds, government benefits, gig economy payouts | | Fees | Bank fees, late fees, ATM surcharges, service charges | ### Account-Type-Implied Categories If you know the account type, some categories can be assigned without the model: | Account Type | Category | |---|---| | Mortgage | Mortgage | | Auto Loan | Transportation | | Student Loan | Education | | Personal Loan | Transfer | | HELOC | Transfer | | CD | Income | For checking, savings, and credit card accounts, use the model. ## Training ``` Model: DistilBERT-base-uncased + LoRA (r=32, alpha=64) Dataset: 68,000 synthetic samples (4,000 per category) Trainable: 1.8M / 68.7M parameters (2.6%) Training: 20 epochs, best at epoch 16 Validation: 99.9% accuracy (15 of 17 categories at 100%) ``` ### Multi-Format Training The model is trained on 8 bank statement formats so it classifies correctly regardless of which bank produced the description: | Format | Example | Source | |---|---|---| | Chase merchant | `STARBUCKS #1234` | Chase credit cards | | Chase ACH | `INSTITUTION PURPOSE PPD ID: CODE` | Chase checking | | Apple Card | `MERCHANT ADDRESS CITY ZIP STATE USA` | Apple Card | | PayPal native | `PreApproved Payment Bill User Payment: MERCHANT` | PayPal credit card | | PayPal prefix | `PP*MERCHANT`, `PYPL*MERCHANT`, `PAYPAL *MERCHANT` | Chase/other banks | | Capital One | `Withdrawal from MERCHANT`, `Preauthorized Deposit from MERCHANT` | Capital One | | Mercury | `MERCHANT; Description` or just `MERCHANT` | Mercury, neobanks | | POS prefix | `SQ *MERCHANT`, `TST*MERCHANT`, `CLV*MERCHANT` | Square, Toast, Clover | PayPal formats appear across all spending categories at meaningful rates, reflecting that people use PayPal cards at any merchant. ### Honest Assessment The 99.9% validation accuracy is on synthetic data. On ~2,000 real transactions: - **96.1% of model classifications at 0.90+ confidence** - **< 0.5% below 0.50 confidence** - 17 bank-category fallbacks (obscure merchants where the model defers) - Shopping is the weakest category due to overlap with Subscription and Groceries - Niche/unknown merchants may classify with lower confidence — use merchant rules for known edge cases ## Usage ### Python ```python from transformers import pipeline classifier = pipeline("text-classification", model="DoDataThings/distilbert-us-transaction-classifier-v2") # Sign prefix required result = classifier("[debit] STARBUCKS #1234 SAN FRANCISCO CA") print(result) # [{'label': 'Restaurants', 'score': 0.99}] # Sign matters for ambiguous transactions classifier("[credit] VENMO CASHOUT PPD ID: 12345678") # [{'label': 'Income', 'score': 0.95}] classifier("[debit] VENMO PAYMENT TO JOHN SMITH") # [{'label': 'Transfer', 'score': 0.97}] # Works across all bank formats classifier("[debit] PreApproved Payment Bill User Payment: Netflix") # [{'label': 'Subscription', 'score': 1.00}] classifier("[debit] PP*SAFEWAY") # [{'label': 'Groceries', 'score': 1.00}] ``` ### JavaScript (Transformers.js) ```javascript const { pipeline } = require('@xenova/transformers'); const classifier = await pipeline( 'text-classification', 'DoDataThings/distilbert-us-transaction-classifier-v2' ); const result = await classifier('[debit] STARBUCKS #1234'); // [{ label: 'Restaurants', score: 0.99 }] ``` An ONNX export is included in the `onnx/` subdirectory. ## Design Decisions - **Sign prefix, not account type.** We considered passing account type (checking, credit, etc.) as a feature but concluded that sign alone provides the disambiguation signal. Account type is an upstream routing concern — it determines which classifier runs, not what the classifier outputs. - **17 model categories + 6 account-type categories.** Mortgage is both a model category (for classifying mortgage descriptions on checking accounts) and an account-type-implied category (for mortgage account transactions). This serves both use cases — people with account type metadata and people with just transaction descriptions. - **PayPal as a bank format, not a wrapper.** PayPal is a card issuer. People use PayPal cards at restaurants, grocery stores, and everywhere else. The training data treats PayPal formats as first-class bank statement structures across all categories. - **Synthetic data with real formats.** The training data is synthetic but models real bank statement patterns — Chase ACH padding, Apple Card address formats, Capital One action prefixes, Mercury's minimal format. The generator is open source so you can extend it. ## Training Data The dataset is published at [`DoDataThings/us-bank-transaction-categories-v2`](https://huggingface.co/datasets/DoDataThings/us-bank-transaction-categories-v2). ## Generator The synthetic data generator is open source: ```bash node scripts/generate-training-data.js --count 4000 # 4,000 per category ``` Available at [github.com/wnstnb/foliome](https://github.com/wnstnb/foliome). ## Limitations - **US bank formats only** — Trained on Chase, Apple Card, PayPal, Capital One, Mercury, and US Bank patterns - **Synthetic training data** — May miss patterns from banks not represented - **Shopping is the weakest category** due to overlap with Subscription and Groceries - **Sign prefix required** — Passing raw descriptions without `[debit]`/`[credit]` will degrade accuracy - **Not a standalone solution** — Best results come from combining with merchant rules and account-type classification ## License Apache 2.0