DoDataThings's picture
Upload README.md with huggingface_hub
2bbf676 verified
metadata
license: apache-2.0
tags:
  - text-classification
  - transformers
  - onnx
  - safetensors
  - transformers.js
  - distilbert
  - finance
  - transactions
  - english
language:
  - en
datasets:
  - DoDataThings/us-bank-transaction-categories-v2
pipeline_tag: text-classification

DistilBERT US Bank Transaction Classifier v2

A fine-tuned DistilBERT model that classifies US bank transaction descriptions into 17 spending categories. Uses a [debit]/[credit] sign prefix to disambiguate transaction direction β€” a payroll deposit and a Venmo payment look similar in text but mean opposite things financially.

Successor to v1, which classified on description text alone. v2 adds sign-aware input, expanded merchant coverage (500+), multi-format training across 8 bank statement structures, and PayPal as a first-class format.

How It Works

The model takes a sign prefix + transaction description and outputs one of 17 categories:

Input:  "[debit] STARBUCKS #1234 SAN FRANCISCO CA"
Output: Restaurants (0.99)

Input:  "[credit] ACME CORP       PAYROLL                    PPD ID: 123456789"
Output: Income (1.00)

Input:  "[debit] CHASE CREDIT CRD AUTOPAY                    PPD ID: 9876543210"
Output: Transfer (1.00)

Input:  "[debit] PreApproved Payment Bill User Payment: Netflix"
Output: Subscription (1.00)

The sign prefix encodes the transaction direction from the cardholder's perspective:

  • [debit] β€” money left the account (purchases, payments out, fees)
  • [credit] β€” money entered the account (income, refunds, payments received)

This is critical for distinguishing Income from Transfer. [credit] VENMO CASHOUT is Income (money arriving). [debit] VENMO PAYMENT TO JOHN SMITH is Transfer (money leaving). The description alone can't tell you which.

Categories (17)

Category What it covers
Restaurants Fast food, sit-down, coffee, delivery, POS systems (TST*, SQ*, CLV*)
Groceries Supermarkets, warehouse clubs, farmers markets, convenience stores
Shopping Retail, online, department stores, pet stores, liquor stores, e-commerce marketplaces
Transportation Gas, EV charging, rideshare, auto service, parking, tolls, DMV
Entertainment Movies, events, gaming, gambling/sportsbooks
Utilities Electric, internet, phone, water, waste/trash, solar
Subscription Streaming, SaaS, AI tools, VPNs, social media premium, dating, business SaaS
Healthcare Pharmacy, doctor, dentist, telehealth, vision, hospital
Insurance Auto, home, health, life, home warranty
Mortgage Bank, credit union, and fintech mortgage payments, escrow, principal
Rent Property management companies, lease payments
Travel Hotels, airlines, car rental, cruise lines, airport services
Education Online courses, tutoring, books, tuition, certification
Personal Care Salon, gym, beauty, spa, barber
Transfer CC autopay, P2P sends, bank transfers, brokerage sweeps, fintech, BNPL, wire, ATM, cashier's checks
Income Payroll, direct deposit, interest, refunds, government benefits, gig economy payouts
Fees Bank fees, late fees, ATM surcharges, service charges

Account-Type-Implied Categories

If you know the account type, some categories can be assigned without the model:

Account Type Category
Mortgage Mortgage
Auto Loan Transportation
Student Loan Education
Personal Loan Transfer
HELOC Transfer
CD Income

For checking, savings, and credit card accounts, use the model.

Training

Model:       DistilBERT-base-uncased + LoRA (r=32, alpha=64)
Dataset:     68,000 synthetic samples (4,000 per category)
Trainable:   1.8M / 68.7M parameters (2.6%)
Training:    20 epochs, best at epoch 16
Validation:  99.9% accuracy (15 of 17 categories at 100%)

Multi-Format Training

The model is trained on 8 bank statement formats so it classifies correctly regardless of which bank produced the description:

Format Example Source
Chase merchant STARBUCKS #1234 Chase credit cards
Chase ACH INSTITUTION PURPOSE PPD ID: CODE Chase checking
Apple Card MERCHANT ADDRESS CITY ZIP STATE USA Apple Card
PayPal native PreApproved Payment Bill User Payment: MERCHANT PayPal credit card
PayPal prefix PP*MERCHANT, PYPL*MERCHANT, PAYPAL *MERCHANT Chase/other banks
Capital One Withdrawal from MERCHANT, Preauthorized Deposit from MERCHANT Capital One
Mercury MERCHANT; Description or just MERCHANT Mercury, neobanks
POS prefix SQ *MERCHANT, TST*MERCHANT, CLV*MERCHANT Square, Toast, Clover

PayPal formats appear across all spending categories at meaningful rates, reflecting that people use PayPal cards at any merchant.

Honest Assessment

The 99.9% validation accuracy is on synthetic data. On ~2,000 real transactions:

  • 96.1% of model classifications at 0.90+ confidence
  • < 0.5% below 0.50 confidence
  • 17 bank-category fallbacks (obscure merchants where the model defers)
  • Shopping is the weakest category due to overlap with Subscription and Groceries
  • Niche/unknown merchants may classify with lower confidence β€” use merchant rules for known edge cases

Usage

Python

from transformers import pipeline

classifier = pipeline("text-classification", model="DoDataThings/distilbert-us-transaction-classifier-v2")

# Sign prefix required
result = classifier("[debit] STARBUCKS #1234 SAN FRANCISCO CA")
print(result)  # [{'label': 'Restaurants', 'score': 0.99}]

# Sign matters for ambiguous transactions
classifier("[credit] VENMO CASHOUT PPD ID: 12345678")
# [{'label': 'Income', 'score': 0.95}]

classifier("[debit] VENMO PAYMENT TO JOHN SMITH")
# [{'label': 'Transfer', 'score': 0.97}]

# Works across all bank formats
classifier("[debit] PreApproved Payment Bill User Payment: Netflix")
# [{'label': 'Subscription', 'score': 1.00}]

classifier("[debit] PP*SAFEWAY")
# [{'label': 'Groceries', 'score': 1.00}]

JavaScript (Transformers.js)

const { pipeline } = require('@xenova/transformers');

const classifier = await pipeline(
  'text-classification',
  'DoDataThings/distilbert-us-transaction-classifier-v2'
);

const result = await classifier('[debit] STARBUCKS #1234');
// [{ label: 'Restaurants', score: 0.99 }]

An ONNX export is included in the onnx/ subdirectory.

Design Decisions

  • Sign prefix, not account type. We considered passing account type (checking, credit, etc.) as a feature but concluded that sign alone provides the disambiguation signal. Account type is an upstream routing concern β€” it determines which classifier runs, not what the classifier outputs.
  • 17 model categories + 6 account-type categories. Mortgage is both a model category (for classifying mortgage descriptions on checking accounts) and an account-type-implied category (for mortgage account transactions). This serves both use cases β€” people with account type metadata and people with just transaction descriptions.
  • PayPal as a bank format, not a wrapper. PayPal is a card issuer. People use PayPal cards at restaurants, grocery stores, and everywhere else. The training data treats PayPal formats as first-class bank statement structures across all categories.
  • Synthetic data with real formats. The training data is synthetic but models real bank statement patterns β€” Chase ACH padding, Apple Card address formats, Capital One action prefixes, Mercury's minimal format. The generator is open source so you can extend it.

Training Data

The dataset is published at DoDataThings/us-bank-transaction-categories-v2.

Generator

The synthetic data generator is open source:

node scripts/generate-training-data.js --count 4000  # 4,000 per category

Available at github.com/wnstnb/foliome.

Limitations

  • US bank formats only β€” Trained on Chase, Apple Card, PayPal, Capital One, Mercury, and US Bank patterns
  • Synthetic training data β€” May miss patterns from banks not represented
  • Shopping is the weakest category due to overlap with Subscription and Groceries
  • Sign prefix required β€” Passing raw descriptions without [debit]/[credit] will degrade accuracy
  • Not a standalone solution β€” Best results come from combining with merchant rules and account-type classification

License

Apache 2.0