| --- |
| license: apache-2.0 |
| tags: |
| - text-classification |
| - transformers |
| - onnx |
| - safetensors |
| - transformers.js |
| - distilbert |
| - finance |
| - transactions |
| - english |
| language: |
| - en |
| datasets: |
| - DoDataThings/us-bank-transaction-categories-v2 |
| pipeline_tag: text-classification |
| --- |
| |
| # DistilBERT US Bank Transaction Classifier v2 |
|
|
| A fine-tuned DistilBERT model that classifies US bank transaction descriptions into 17 spending categories. Uses a `[debit]`/`[credit]` sign prefix to disambiguate transaction direction — a payroll deposit and a Venmo payment look similar in text but mean opposite things financially. |
|
|
| **Successor to [v1](https://huggingface.co/DoDataThings/distilbert-us-transaction-classifier)**, which classified on description text alone. v2 adds sign-aware input, expanded merchant coverage (500+), multi-format training across 8 bank statement structures, and PayPal as a first-class format. |
|
|
| ## How It Works |
|
|
| The model takes a sign prefix + transaction description and outputs one of 17 categories: |
|
|
| ``` |
| Input: "[debit] STARBUCKS #1234 SAN FRANCISCO CA" |
| Output: Restaurants (0.99) |
| |
| Input: "[credit] ACME CORP PAYROLL PPD ID: 123456789" |
| Output: Income (1.00) |
| |
| Input: "[debit] CHASE CREDIT CRD AUTOPAY PPD ID: 9876543210" |
| Output: Transfer (1.00) |
| |
| Input: "[debit] PreApproved Payment Bill User Payment: Netflix" |
| Output: Subscription (1.00) |
| ``` |
|
|
| The sign prefix encodes the transaction direction from the cardholder's perspective: |
| - `[debit]` — money left the account (purchases, payments out, fees) |
| - `[credit]` — money entered the account (income, refunds, payments received) |
|
|
| This is critical for distinguishing Income from Transfer. `[credit] VENMO CASHOUT` is Income (money arriving). `[debit] VENMO PAYMENT TO JOHN SMITH` is Transfer (money leaving). The description alone can't tell you which. |
|
|
| ## Categories (17) |
|
|
| | Category | What it covers | |
| |----------|----------------| |
| | Restaurants | Fast food, sit-down, coffee, delivery, POS systems (TST*, SQ*, CLV*) | |
| | Groceries | Supermarkets, warehouse clubs, farmers markets, convenience stores | |
| | Shopping | Retail, online, department stores, pet stores, liquor stores, e-commerce marketplaces | |
| | Transportation | Gas, EV charging, rideshare, auto service, parking, tolls, DMV | |
| | Entertainment | Movies, events, gaming, gambling/sportsbooks | |
| | Utilities | Electric, internet, phone, water, waste/trash, solar | |
| | Subscription | Streaming, SaaS, AI tools, VPNs, social media premium, dating, business SaaS | |
| | Healthcare | Pharmacy, doctor, dentist, telehealth, vision, hospital | |
| | Insurance | Auto, home, health, life, home warranty | |
| | Mortgage | Bank, credit union, and fintech mortgage payments, escrow, principal | |
| | Rent | Property management companies, lease payments | |
| | Travel | Hotels, airlines, car rental, cruise lines, airport services | |
| | Education | Online courses, tutoring, books, tuition, certification | |
| | Personal Care | Salon, gym, beauty, spa, barber | |
| | Transfer | CC autopay, P2P sends, bank transfers, brokerage sweeps, fintech, BNPL, wire, ATM, cashier's checks | |
| | Income | Payroll, direct deposit, interest, refunds, government benefits, gig economy payouts | |
| | Fees | Bank fees, late fees, ATM surcharges, service charges | |
| |
| ### Account-Type-Implied Categories |
| |
| If you know the account type, some categories can be assigned without the model: |
| |
| | Account Type | Category | |
| |---|---| |
| | Mortgage | Mortgage | |
| | Auto Loan | Transportation | |
| | Student Loan | Education | |
| | Personal Loan | Transfer | |
| | HELOC | Transfer | |
| | CD | Income | |
| |
| For checking, savings, and credit card accounts, use the model. |
| |
| ## Training |
| |
| ``` |
| Model: DistilBERT-base-uncased + LoRA (r=32, alpha=64) |
| Dataset: 68,000 synthetic samples (4,000 per category) |
| Trainable: 1.8M / 68.7M parameters (2.6%) |
| Training: 20 epochs, best at epoch 16 |
| Validation: 99.9% accuracy (15 of 17 categories at 100%) |
| ``` |
| |
| ### Multi-Format Training |
| |
| The model is trained on 8 bank statement formats so it classifies correctly regardless of which bank produced the description: |
| |
| | Format | Example | Source | |
| |---|---|---| |
| | Chase merchant | `STARBUCKS #1234` | Chase credit cards | |
| | Chase ACH | `INSTITUTION PURPOSE PPD ID: CODE` | Chase checking | |
| | Apple Card | `MERCHANT ADDRESS CITY ZIP STATE USA` | Apple Card | |
| | PayPal native | `PreApproved Payment Bill User Payment: MERCHANT` | PayPal credit card | |
| | PayPal prefix | `PP*MERCHANT`, `PYPL*MERCHANT`, `PAYPAL *MERCHANT` | Chase/other banks | |
| | Capital One | `Withdrawal from MERCHANT`, `Preauthorized Deposit from MERCHANT` | Capital One | |
| | Mercury | `MERCHANT; Description` or just `MERCHANT` | Mercury, neobanks | |
| | POS prefix | `SQ *MERCHANT`, `TST*MERCHANT`, `CLV*MERCHANT` | Square, Toast, Clover | |
|
|
| PayPal formats appear across all spending categories at meaningful rates, reflecting that people use PayPal cards at any merchant. |
|
|
| ### Honest Assessment |
|
|
| The 99.9% validation accuracy is on synthetic data. On ~2,000 real transactions: |
|
|
| - **96.1% of model classifications at 0.90+ confidence** |
| - **< 0.5% below 0.50 confidence** |
| - 17 bank-category fallbacks (obscure merchants where the model defers) |
| - Shopping is the weakest category due to overlap with Subscription and Groceries |
| - Niche/unknown merchants may classify with lower confidence — use merchant rules for known edge cases |
|
|
| ## Usage |
|
|
| ### Python |
|
|
| ```python |
| from transformers import pipeline |
| |
| classifier = pipeline("text-classification", model="DoDataThings/distilbert-us-transaction-classifier-v2") |
| |
| # Sign prefix required |
| result = classifier("[debit] STARBUCKS #1234 SAN FRANCISCO CA") |
| print(result) # [{'label': 'Restaurants', 'score': 0.99}] |
| |
| # Sign matters for ambiguous transactions |
| classifier("[credit] VENMO CASHOUT PPD ID: 12345678") |
| # [{'label': 'Income', 'score': 0.95}] |
| |
| classifier("[debit] VENMO PAYMENT TO JOHN SMITH") |
| # [{'label': 'Transfer', 'score': 0.97}] |
| |
| # Works across all bank formats |
| classifier("[debit] PreApproved Payment Bill User Payment: Netflix") |
| # [{'label': 'Subscription', 'score': 1.00}] |
| |
| classifier("[debit] PP*SAFEWAY") |
| # [{'label': 'Groceries', 'score': 1.00}] |
| ``` |
|
|
| ### JavaScript (Transformers.js) |
|
|
| ```javascript |
| const { pipeline } = require('@xenova/transformers'); |
| |
| const classifier = await pipeline( |
| 'text-classification', |
| 'DoDataThings/distilbert-us-transaction-classifier-v2' |
| ); |
| |
| const result = await classifier('[debit] STARBUCKS #1234'); |
| // [{ label: 'Restaurants', score: 0.99 }] |
| ``` |
|
|
| An ONNX export is included in the `onnx/` subdirectory. |
|
|
| ## Design Decisions |
|
|
| - **Sign prefix, not account type.** We considered passing account type (checking, credit, etc.) as a feature but concluded that sign alone provides the disambiguation signal. Account type is an upstream routing concern — it determines which classifier runs, not what the classifier outputs. |
| - **17 model categories + 6 account-type categories.** Mortgage is both a model category (for classifying mortgage descriptions on checking accounts) and an account-type-implied category (for mortgage account transactions). This serves both use cases — people with account type metadata and people with just transaction descriptions. |
| - **PayPal as a bank format, not a wrapper.** PayPal is a card issuer. People use PayPal cards at restaurants, grocery stores, and everywhere else. The training data treats PayPal formats as first-class bank statement structures across all categories. |
| - **Synthetic data with real formats.** The training data is synthetic but models real bank statement patterns — Chase ACH padding, Apple Card address formats, Capital One action prefixes, Mercury's minimal format. The generator is open source so you can extend it. |
|
|
| ## Training Data |
|
|
| The dataset is published at [`DoDataThings/us-bank-transaction-categories-v2`](https://huggingface.co/datasets/DoDataThings/us-bank-transaction-categories-v2). |
|
|
| ## Generator |
|
|
| The synthetic data generator is open source: |
|
|
| ```bash |
| node scripts/generate-training-data.js --count 4000 # 4,000 per category |
| ``` |
|
|
| Available at [github.com/wnstnb/foliome](https://github.com/wnstnb/foliome). |
|
|
| ## Limitations |
|
|
| - **US bank formats only** — Trained on Chase, Apple Card, PayPal, Capital One, Mercury, and US Bank patterns |
| - **Synthetic training data** — May miss patterns from banks not represented |
| - **Shopping is the weakest category** due to overlap with Subscription and Groceries |
| - **Sign prefix required** — Passing raw descriptions without `[debit]`/`[credit]` will degrade accuracy |
| - **Not a standalone solution** — Best results come from combining with merchant rules and account-type classification |
|
|
| ## License |
|
|
| Apache 2.0 |
|
|