Upload README.md with huggingface_hub

2bbf676 verified 3 days ago

8.68 kB

license: apache-2.0
tags:
  - text-classification
  - transformers
  - onnx
  - safetensors
  - transformers.js
  - distilbert
  - finance
  - transactions
  - english
language:
  - en
datasets:
  - DoDataThings/us-bank-transaction-categories-v2
pipeline_tag: text-classification

DistilBERT US Bank Transaction Classifier v2

A fine-tuned DistilBERT model that classifies US bank transaction descriptions into 17 spending categories. Uses a [debit]/[credit] sign prefix to disambiguate transaction direction — a payroll deposit and a Venmo payment look similar in text but mean opposite things financially.

Successor to v1, which classified on description text alone. v2 adds sign-aware input, expanded merchant coverage (500+), multi-format training across 8 bank statement structures, and PayPal as a first-class format.

How It Works

The model takes a sign prefix + transaction description and outputs one of 17 categories:

Input:  "[debit] STARBUCKS #1234 SAN FRANCISCO CA"
Output: Restaurants (0.99)

Input:  "[credit] ACME CORP       PAYROLL                    PPD ID: 123456789"
Output: Income (1.00)

Input:  "[debit] CHASE CREDIT CRD AUTOPAY                    PPD ID: 9876543210"
Output: Transfer (1.00)

Input:  "[debit] PreApproved Payment Bill User Payment: Netflix"
Output: Subscription (1.00)

The sign prefix encodes the transaction direction from the cardholder's perspective:

[debit] — money left the account (purchases, payments out, fees)
[credit] — money entered the account (income, refunds, payments received)

This is critical for distinguishing Income from Transfer. [credit] VENMO CASHOUT is Income (money arriving). [debit] VENMO PAYMENT TO JOHN SMITH is Transfer (money leaving). The description alone can't tell you which.

Categories (17)

Category	What it covers
Restaurants	Fast food, sit-down, coffee, delivery, POS systems (TST, SQ, CLV*)
Groceries	Supermarkets, warehouse clubs, farmers markets, convenience stores
Shopping	Retail, online, department stores, pet stores, liquor stores, e-commerce marketplaces
Transportation	Gas, EV charging, rideshare, auto service, parking, tolls, DMV
Entertainment	Movies, events, gaming, gambling/sportsbooks
Utilities	Electric, internet, phone, water, waste/trash, solar
Subscription	Streaming, SaaS, AI tools, VPNs, social media premium, dating, business SaaS
Healthcare	Pharmacy, doctor, dentist, telehealth, vision, hospital
Insurance	Auto, home, health, life, home warranty
Mortgage	Bank, credit union, and fintech mortgage payments, escrow, principal
Rent	Property management companies, lease payments
Travel	Hotels, airlines, car rental, cruise lines, airport services
Education	Online courses, tutoring, books, tuition, certification
Personal Care	Salon, gym, beauty, spa, barber
Transfer	CC autopay, P2P sends, bank transfers, brokerage sweeps, fintech, BNPL, wire, ATM, cashier's checks
Income	Payroll, direct deposit, interest, refunds, government benefits, gig economy payouts
Fees	Bank fees, late fees, ATM surcharges, service charges

Account-Type-Implied Categories

If you know the account type, some categories can be assigned without the model:

Account Type	Category
Mortgage	Mortgage
Auto Loan	Transportation
Student Loan	Education
Personal Loan	Transfer
HELOC	Transfer
CD	Income

For checking, savings, and credit card accounts, use the model.

Training

Model:       DistilBERT-base-uncased + LoRA (r=32, alpha=64)
Dataset:     68,000 synthetic samples (4,000 per category)
Trainable:   1.8M / 68.7M parameters (2.6%)
Training:    20 epochs, best at epoch 16
Validation:  99.9% accuracy (15 of 17 categories at 100%)

Multi-Format Training

The model is trained on 8 bank statement formats so it classifies correctly regardless of which bank produced the description:

Format	Example	Source
Chase merchant	`STARBUCKS #1234`	Chase credit cards
Chase ACH	`INSTITUTION PURPOSE PPD ID: CODE`	Chase checking
Apple Card	`MERCHANT ADDRESS CITY ZIP STATE USA`	Apple Card
PayPal native	`PreApproved Payment Bill User Payment: MERCHANT`	PayPal credit card
PayPal prefix	`PPMERCHANT`, `PYPLMERCHANT`, `PAYPAL *MERCHANT`	Chase/other banks
Capital One	`Withdrawal from MERCHANT`, `Preauthorized Deposit from MERCHANT`	Capital One
Mercury	`MERCHANT; Description` or just `MERCHANT`	Mercury, neobanks
POS prefix	`SQ MERCHANT`, `TSTMERCHANT`, `CLV*MERCHANT`	Square, Toast, Clover

PayPal formats appear across all spending categories at meaningful rates, reflecting that people use PayPal cards at any merchant.

Honest Assessment

The 99.9% validation accuracy is on synthetic data. On ~2,000 real transactions:

96.1% of model classifications at 0.90+ confidence
< 0.5% below 0.50 confidence
17 bank-category fallbacks (obscure merchants where the model defers)
Shopping is the weakest category due to overlap with Subscription and Groceries
Niche/unknown merchants may classify with lower confidence — use merchant rules for known edge cases

Usage

Python

from transformers import pipeline

classifier = pipeline("text-classification", model="DoDataThings/distilbert-us-transaction-classifier-v2")

# Sign prefix required
result = classifier("[debit] STARBUCKS #1234 SAN FRANCISCO CA")
print(result)  # [{'label': 'Restaurants', 'score': 0.99}]

# Sign matters for ambiguous transactions
classifier("[credit] VENMO CASHOUT PPD ID: 12345678")
# [{'label': 'Income', 'score': 0.95}]

classifier("[debit] VENMO PAYMENT TO JOHN SMITH")
# [{'label': 'Transfer', 'score': 0.97}]

# Works across all bank formats
classifier("[debit] PreApproved Payment Bill User Payment: Netflix")
# [{'label': 'Subscription', 'score': 1.00}]

classifier("[debit] PP*SAFEWAY")
# [{'label': 'Groceries', 'score': 1.00}]

JavaScript (Transformers.js)

const { pipeline } = require('@xenova/transformers');

const classifier = await pipeline(
  'text-classification',
  'DoDataThings/distilbert-us-transaction-classifier-v2'
);

const result = await classifier('[debit] STARBUCKS #1234');
// [{ label: 'Restaurants', score: 0.99 }]

An ONNX export is included in the onnx/ subdirectory.

Design Decisions

Sign prefix, not account type. We considered passing account type (checking, credit, etc.) as a feature but concluded that sign alone provides the disambiguation signal. Account type is an upstream routing concern — it determines which classifier runs, not what the classifier outputs.
17 model categories + 6 account-type categories. Mortgage is both a model category (for classifying mortgage descriptions on checking accounts) and an account-type-implied category (for mortgage account transactions). This serves both use cases — people with account type metadata and people with just transaction descriptions.
PayPal as a bank format, not a wrapper. PayPal is a card issuer. People use PayPal cards at restaurants, grocery stores, and everywhere else. The training data treats PayPal formats as first-class bank statement structures across all categories.
Synthetic data with real formats. The training data is synthetic but models real bank statement patterns — Chase ACH padding, Apple Card address formats, Capital One action prefixes, Mercury's minimal format. The generator is open source so you can extend it.

Training Data

The dataset is published at DoDataThings/us-bank-transaction-categories-v2.

Generator

The synthetic data generator is open source:

node scripts/generate-training-data.js --count 4000  # 4,000 per category

Available at github.com/wnstnb/foliome.

Limitations

US bank formats only — Trained on Chase, Apple Card, PayPal, Capital One, Mercury, and US Bank patterns
Synthetic training data — May miss patterns from banks not represented
Shopping is the weakest category due to overlap with Subscription and Groceries
Sign prefix required — Passing raw descriptions without [debit]/[credit] will degrade accuracy
Not a standalone solution — Best results come from combining with merchant rules and account-type classification

License

Apache 2.0