File size: 8,678 Bytes
454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d 2bbf676 cec1948 454644d 2bbf676 cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d cec1948 454644d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 | ---
license: apache-2.0
tags:
- text-classification
- transformers
- onnx
- safetensors
- transformers.js
- distilbert
- finance
- transactions
- english
language:
- en
datasets:
- DoDataThings/us-bank-transaction-categories-v2
pipeline_tag: text-classification
---
# DistilBERT US Bank Transaction Classifier v2
A fine-tuned DistilBERT model that classifies US bank transaction descriptions into 17 spending categories. Uses a `[debit]`/`[credit]` sign prefix to disambiguate transaction direction β a payroll deposit and a Venmo payment look similar in text but mean opposite things financially.
**Successor to [v1](https://huggingface.co/DoDataThings/distilbert-us-transaction-classifier)**, which classified on description text alone. v2 adds sign-aware input, expanded merchant coverage (500+), multi-format training across 8 bank statement structures, and PayPal as a first-class format.
## How It Works
The model takes a sign prefix + transaction description and outputs one of 17 categories:
```
Input: "[debit] STARBUCKS #1234 SAN FRANCISCO CA"
Output: Restaurants (0.99)
Input: "[credit] ACME CORP PAYROLL PPD ID: 123456789"
Output: Income (1.00)
Input: "[debit] CHASE CREDIT CRD AUTOPAY PPD ID: 9876543210"
Output: Transfer (1.00)
Input: "[debit] PreApproved Payment Bill User Payment: Netflix"
Output: Subscription (1.00)
```
The sign prefix encodes the transaction direction from the cardholder's perspective:
- `[debit]` β money left the account (purchases, payments out, fees)
- `[credit]` β money entered the account (income, refunds, payments received)
This is critical for distinguishing Income from Transfer. `[credit] VENMO CASHOUT` is Income (money arriving). `[debit] VENMO PAYMENT TO JOHN SMITH` is Transfer (money leaving). The description alone can't tell you which.
## Categories (17)
| Category | What it covers |
|----------|----------------|
| Restaurants | Fast food, sit-down, coffee, delivery, POS systems (TST*, SQ*, CLV*) |
| Groceries | Supermarkets, warehouse clubs, farmers markets, convenience stores |
| Shopping | Retail, online, department stores, pet stores, liquor stores, e-commerce marketplaces |
| Transportation | Gas, EV charging, rideshare, auto service, parking, tolls, DMV |
| Entertainment | Movies, events, gaming, gambling/sportsbooks |
| Utilities | Electric, internet, phone, water, waste/trash, solar |
| Subscription | Streaming, SaaS, AI tools, VPNs, social media premium, dating, business SaaS |
| Healthcare | Pharmacy, doctor, dentist, telehealth, vision, hospital |
| Insurance | Auto, home, health, life, home warranty |
| Mortgage | Bank, credit union, and fintech mortgage payments, escrow, principal |
| Rent | Property management companies, lease payments |
| Travel | Hotels, airlines, car rental, cruise lines, airport services |
| Education | Online courses, tutoring, books, tuition, certification |
| Personal Care | Salon, gym, beauty, spa, barber |
| Transfer | CC autopay, P2P sends, bank transfers, brokerage sweeps, fintech, BNPL, wire, ATM, cashier's checks |
| Income | Payroll, direct deposit, interest, refunds, government benefits, gig economy payouts |
| Fees | Bank fees, late fees, ATM surcharges, service charges |
### Account-Type-Implied Categories
If you know the account type, some categories can be assigned without the model:
| Account Type | Category |
|---|---|
| Mortgage | Mortgage |
| Auto Loan | Transportation |
| Student Loan | Education |
| Personal Loan | Transfer |
| HELOC | Transfer |
| CD | Income |
For checking, savings, and credit card accounts, use the model.
## Training
```
Model: DistilBERT-base-uncased + LoRA (r=32, alpha=64)
Dataset: 68,000 synthetic samples (4,000 per category)
Trainable: 1.8M / 68.7M parameters (2.6%)
Training: 20 epochs, best at epoch 16
Validation: 99.9% accuracy (15 of 17 categories at 100%)
```
### Multi-Format Training
The model is trained on 8 bank statement formats so it classifies correctly regardless of which bank produced the description:
| Format | Example | Source |
|---|---|---|
| Chase merchant | `STARBUCKS #1234` | Chase credit cards |
| Chase ACH | `INSTITUTION PURPOSE PPD ID: CODE` | Chase checking |
| Apple Card | `MERCHANT ADDRESS CITY ZIP STATE USA` | Apple Card |
| PayPal native | `PreApproved Payment Bill User Payment: MERCHANT` | PayPal credit card |
| PayPal prefix | `PP*MERCHANT`, `PYPL*MERCHANT`, `PAYPAL *MERCHANT` | Chase/other banks |
| Capital One | `Withdrawal from MERCHANT`, `Preauthorized Deposit from MERCHANT` | Capital One |
| Mercury | `MERCHANT; Description` or just `MERCHANT` | Mercury, neobanks |
| POS prefix | `SQ *MERCHANT`, `TST*MERCHANT`, `CLV*MERCHANT` | Square, Toast, Clover |
PayPal formats appear across all spending categories at meaningful rates, reflecting that people use PayPal cards at any merchant.
### Honest Assessment
The 99.9% validation accuracy is on synthetic data. On ~2,000 real transactions:
- **96.1% of model classifications at 0.90+ confidence**
- **< 0.5% below 0.50 confidence**
- 17 bank-category fallbacks (obscure merchants where the model defers)
- Shopping is the weakest category due to overlap with Subscription and Groceries
- Niche/unknown merchants may classify with lower confidence β use merchant rules for known edge cases
## Usage
### Python
```python
from transformers import pipeline
classifier = pipeline("text-classification", model="DoDataThings/distilbert-us-transaction-classifier-v2")
# Sign prefix required
result = classifier("[debit] STARBUCKS #1234 SAN FRANCISCO CA")
print(result) # [{'label': 'Restaurants', 'score': 0.99}]
# Sign matters for ambiguous transactions
classifier("[credit] VENMO CASHOUT PPD ID: 12345678")
# [{'label': 'Income', 'score': 0.95}]
classifier("[debit] VENMO PAYMENT TO JOHN SMITH")
# [{'label': 'Transfer', 'score': 0.97}]
# Works across all bank formats
classifier("[debit] PreApproved Payment Bill User Payment: Netflix")
# [{'label': 'Subscription', 'score': 1.00}]
classifier("[debit] PP*SAFEWAY")
# [{'label': 'Groceries', 'score': 1.00}]
```
### JavaScript (Transformers.js)
```javascript
const { pipeline } = require('@xenova/transformers');
const classifier = await pipeline(
'text-classification',
'DoDataThings/distilbert-us-transaction-classifier-v2'
);
const result = await classifier('[debit] STARBUCKS #1234');
// [{ label: 'Restaurants', score: 0.99 }]
```
An ONNX export is included in the `onnx/` subdirectory.
## Design Decisions
- **Sign prefix, not account type.** We considered passing account type (checking, credit, etc.) as a feature but concluded that sign alone provides the disambiguation signal. Account type is an upstream routing concern β it determines which classifier runs, not what the classifier outputs.
- **17 model categories + 6 account-type categories.** Mortgage is both a model category (for classifying mortgage descriptions on checking accounts) and an account-type-implied category (for mortgage account transactions). This serves both use cases β people with account type metadata and people with just transaction descriptions.
- **PayPal as a bank format, not a wrapper.** PayPal is a card issuer. People use PayPal cards at restaurants, grocery stores, and everywhere else. The training data treats PayPal formats as first-class bank statement structures across all categories.
- **Synthetic data with real formats.** The training data is synthetic but models real bank statement patterns β Chase ACH padding, Apple Card address formats, Capital One action prefixes, Mercury's minimal format. The generator is open source so you can extend it.
## Training Data
The dataset is published at [`DoDataThings/us-bank-transaction-categories-v2`](https://huggingface.co/datasets/DoDataThings/us-bank-transaction-categories-v2).
## Generator
The synthetic data generator is open source:
```bash
node scripts/generate-training-data.js --count 4000 # 4,000 per category
```
Available at [github.com/wnstnb/foliome](https://github.com/wnstnb/foliome).
## Limitations
- **US bank formats only** β Trained on Chase, Apple Card, PayPal, Capital One, Mercury, and US Bank patterns
- **Synthetic training data** β May miss patterns from banks not represented
- **Shopping is the weakest category** due to overlap with Subscription and Groceries
- **Sign prefix required** β Passing raw descriptions without `[debit]`/`[credit]` will degrade accuracy
- **Not a standalone solution** β Best results come from combining with merchant rules and account-type classification
## License
Apache 2.0
|