File size: 8,678 Bytes
454644d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cec1948
454644d
cec1948
454644d
cec1948
454644d
cec1948
454644d
 
cec1948
 
454644d
2bbf676
cec1948
454644d
2bbf676
cec1948
454644d
cec1948
 
 
454644d
cec1948
 
 
454644d
cec1948
454644d
cec1948
454644d
 
 
cec1948
454644d
cec1948
 
 
454644d
cec1948
454644d
cec1948
 
454644d
 
 
 
cec1948
454644d
cec1948
454644d
cec1948
454644d
cec1948
454644d
 
 
 
 
 
 
 
 
 
cec1948
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
454644d
 
 
 
 
 
 
 
 
cec1948
454644d
cec1948
454644d
 
 
 
 
 
 
cec1948
 
 
 
 
 
 
454644d
 
 
 
 
 
 
 
 
 
 
 
 
cec1948
454644d
 
 
 
cec1948
454644d
cec1948
 
 
 
454644d
 
 
cec1948
 
 
 
 
 
 
 
 
 
 
454644d
 
 
cec1948
454644d
cec1948
 
 
454644d
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
---
license: apache-2.0
tags:
  - text-classification
  - transformers
  - onnx
  - safetensors
  - transformers.js
  - distilbert
  - finance
  - transactions
  - english
language:
  - en
datasets:
  - DoDataThings/us-bank-transaction-categories-v2
pipeline_tag: text-classification
---

# DistilBERT US Bank Transaction Classifier v2

A fine-tuned DistilBERT model that classifies US bank transaction descriptions into 17 spending categories. Uses a `[debit]`/`[credit]` sign prefix to disambiguate transaction direction β€” a payroll deposit and a Venmo payment look similar in text but mean opposite things financially.

**Successor to [v1](https://huggingface.co/DoDataThings/distilbert-us-transaction-classifier)**, which classified on description text alone. v2 adds sign-aware input, expanded merchant coverage (500+), multi-format training across 8 bank statement structures, and PayPal as a first-class format.

## How It Works

The model takes a sign prefix + transaction description and outputs one of 17 categories:

```
Input:  "[debit] STARBUCKS #1234 SAN FRANCISCO CA"
Output: Restaurants (0.99)

Input:  "[credit] ACME CORP       PAYROLL                    PPD ID: 123456789"
Output: Income (1.00)

Input:  "[debit] CHASE CREDIT CRD AUTOPAY                    PPD ID: 9876543210"
Output: Transfer (1.00)

Input:  "[debit] PreApproved Payment Bill User Payment: Netflix"
Output: Subscription (1.00)
```

The sign prefix encodes the transaction direction from the cardholder's perspective:
- `[debit]` β€” money left the account (purchases, payments out, fees)
- `[credit]` β€” money entered the account (income, refunds, payments received)

This is critical for distinguishing Income from Transfer. `[credit] VENMO CASHOUT` is Income (money arriving). `[debit] VENMO PAYMENT TO JOHN SMITH` is Transfer (money leaving). The description alone can't tell you which.

## Categories (17)

| Category | What it covers |
|----------|----------------|
| Restaurants | Fast food, sit-down, coffee, delivery, POS systems (TST*, SQ*, CLV*) |
| Groceries | Supermarkets, warehouse clubs, farmers markets, convenience stores |
| Shopping | Retail, online, department stores, pet stores, liquor stores, e-commerce marketplaces |
| Transportation | Gas, EV charging, rideshare, auto service, parking, tolls, DMV |
| Entertainment | Movies, events, gaming, gambling/sportsbooks |
| Utilities | Electric, internet, phone, water, waste/trash, solar |
| Subscription | Streaming, SaaS, AI tools, VPNs, social media premium, dating, business SaaS |
| Healthcare | Pharmacy, doctor, dentist, telehealth, vision, hospital |
| Insurance | Auto, home, health, life, home warranty |
| Mortgage | Bank, credit union, and fintech mortgage payments, escrow, principal |
| Rent | Property management companies, lease payments |
| Travel | Hotels, airlines, car rental, cruise lines, airport services |
| Education | Online courses, tutoring, books, tuition, certification |
| Personal Care | Salon, gym, beauty, spa, barber |
| Transfer | CC autopay, P2P sends, bank transfers, brokerage sweeps, fintech, BNPL, wire, ATM, cashier's checks |
| Income | Payroll, direct deposit, interest, refunds, government benefits, gig economy payouts |
| Fees | Bank fees, late fees, ATM surcharges, service charges |

### Account-Type-Implied Categories

If you know the account type, some categories can be assigned without the model:

| Account Type | Category |
|---|---|
| Mortgage | Mortgage |
| Auto Loan | Transportation |
| Student Loan | Education |
| Personal Loan | Transfer |
| HELOC | Transfer |
| CD | Income |

For checking, savings, and credit card accounts, use the model.

## Training

```
Model:       DistilBERT-base-uncased + LoRA (r=32, alpha=64)
Dataset:     68,000 synthetic samples (4,000 per category)
Trainable:   1.8M / 68.7M parameters (2.6%)
Training:    20 epochs, best at epoch 16
Validation:  99.9% accuracy (15 of 17 categories at 100%)
```

### Multi-Format Training

The model is trained on 8 bank statement formats so it classifies correctly regardless of which bank produced the description:

| Format | Example | Source |
|---|---|---|
| Chase merchant | `STARBUCKS #1234` | Chase credit cards |
| Chase ACH | `INSTITUTION     PURPOSE        PPD ID: CODE` | Chase checking |
| Apple Card | `MERCHANT ADDRESS CITY ZIP STATE USA` | Apple Card |
| PayPal native | `PreApproved Payment Bill User Payment: MERCHANT` | PayPal credit card |
| PayPal prefix | `PP*MERCHANT`, `PYPL*MERCHANT`, `PAYPAL *MERCHANT` | Chase/other banks |
| Capital One | `Withdrawal from MERCHANT`, `Preauthorized Deposit from MERCHANT` | Capital One |
| Mercury | `MERCHANT; Description` or just `MERCHANT` | Mercury, neobanks |
| POS prefix | `SQ *MERCHANT`, `TST*MERCHANT`, `CLV*MERCHANT` | Square, Toast, Clover |

PayPal formats appear across all spending categories at meaningful rates, reflecting that people use PayPal cards at any merchant.

### Honest Assessment

The 99.9% validation accuracy is on synthetic data. On ~2,000 real transactions:

- **96.1% of model classifications at 0.90+ confidence**
- **< 0.5% below 0.50 confidence**
- 17 bank-category fallbacks (obscure merchants where the model defers)
- Shopping is the weakest category due to overlap with Subscription and Groceries
- Niche/unknown merchants may classify with lower confidence β€” use merchant rules for known edge cases

## Usage

### Python

```python
from transformers import pipeline

classifier = pipeline("text-classification", model="DoDataThings/distilbert-us-transaction-classifier-v2")

# Sign prefix required
result = classifier("[debit] STARBUCKS #1234 SAN FRANCISCO CA")
print(result)  # [{'label': 'Restaurants', 'score': 0.99}]

# Sign matters for ambiguous transactions
classifier("[credit] VENMO CASHOUT PPD ID: 12345678")
# [{'label': 'Income', 'score': 0.95}]

classifier("[debit] VENMO PAYMENT TO JOHN SMITH")
# [{'label': 'Transfer', 'score': 0.97}]

# Works across all bank formats
classifier("[debit] PreApproved Payment Bill User Payment: Netflix")
# [{'label': 'Subscription', 'score': 1.00}]

classifier("[debit] PP*SAFEWAY")
# [{'label': 'Groceries', 'score': 1.00}]
```

### JavaScript (Transformers.js)

```javascript
const { pipeline } = require('@xenova/transformers');

const classifier = await pipeline(
  'text-classification',
  'DoDataThings/distilbert-us-transaction-classifier-v2'
);

const result = await classifier('[debit] STARBUCKS #1234');
// [{ label: 'Restaurants', score: 0.99 }]
```

An ONNX export is included in the `onnx/` subdirectory.

## Design Decisions

- **Sign prefix, not account type.** We considered passing account type (checking, credit, etc.) as a feature but concluded that sign alone provides the disambiguation signal. Account type is an upstream routing concern β€” it determines which classifier runs, not what the classifier outputs.
- **17 model categories + 6 account-type categories.** Mortgage is both a model category (for classifying mortgage descriptions on checking accounts) and an account-type-implied category (for mortgage account transactions). This serves both use cases β€” people with account type metadata and people with just transaction descriptions.
- **PayPal as a bank format, not a wrapper.** PayPal is a card issuer. People use PayPal cards at restaurants, grocery stores, and everywhere else. The training data treats PayPal formats as first-class bank statement structures across all categories.
- **Synthetic data with real formats.** The training data is synthetic but models real bank statement patterns β€” Chase ACH padding, Apple Card address formats, Capital One action prefixes, Mercury's minimal format. The generator is open source so you can extend it.

## Training Data

The dataset is published at [`DoDataThings/us-bank-transaction-categories-v2`](https://huggingface.co/datasets/DoDataThings/us-bank-transaction-categories-v2).

## Generator

The synthetic data generator is open source:

```bash
node scripts/generate-training-data.js --count 4000  # 4,000 per category
```

Available at [github.com/wnstnb/foliome](https://github.com/wnstnb/foliome).

## Limitations

- **US bank formats only** β€” Trained on Chase, Apple Card, PayPal, Capital One, Mercury, and US Bank patterns
- **Synthetic training data** β€” May miss patterns from banks not represented
- **Shopping is the weakest category** due to overlap with Subscription and Groceries
- **Sign prefix required** β€” Passing raw descriptions without `[debit]`/`[credit]` will degrade accuracy
- **Not a standalone solution** β€” Best results come from combining with merchant rules and account-type classification

## License

Apache 2.0