Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -12,35 +12,57 @@ tags:
|
|
| 12 |
datasets:
|
| 13 |
- DoDataThings/us-bank-transaction-categories
|
| 14 |
pipeline_tag: text-classification
|
| 15 |
-
model-index:
|
| 16 |
-
- name: distilbert-us-transaction-classifier
|
| 17 |
-
results:
|
| 18 |
-
- task:
|
| 19 |
-
type: text-classification
|
| 20 |
-
name: Transaction Classification
|
| 21 |
-
metrics:
|
| 22 |
-
- type: accuracy
|
| 23 |
-
value: 0.9975
|
| 24 |
-
name: Validation Accuracy
|
| 25 |
---
|
| 26 |
|
| 27 |
# DistilBERT US Bank Transaction Classifier
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
##
|
| 32 |
|
| 33 |
-
|
| 34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
```
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
```
|
| 42 |
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
## Categories (16)
|
| 46 |
|
|
@@ -49,104 +71,36 @@ We tested two popular HuggingFace transaction classifiers on real US bank descri
|
|
| 49 |
| Restaurants | Fast food, sit-down, coffee shops, food delivery |
|
| 50 |
| Groceries | Supermarkets, warehouse clubs, farmers markets |
|
| 51 |
| Shopping | Retail, online purchases, department stores |
|
| 52 |
-
| Transportation | Gas, rideshare, auto maintenance, parking
|
| 53 |
| Entertainment | Movies, events, gaming |
|
| 54 |
| Utilities | Electric, internet, phone, water |
|
| 55 |
-
| Subscription | Streaming, SaaS, news, software
|
| 56 |
| Healthcare | Pharmacy, doctor, dentist, therapy |
|
| 57 |
| Insurance | Auto, home, health, life insurance |
|
| 58 |
| Housing | Rent, mortgage, home maintenance |
|
| 59 |
-
| Travel | Hotels, airlines, car rental
|
| 60 |
| Education | Online courses, books, tuition |
|
| 61 |
| Personal Care | Salon, gym, beauty, spa |
|
| 62 |
-
| Transfer | CC autopay, Zelle/Venmo sends, bank transfers
|
| 63 |
| Income | Payroll, direct deposit, interest, refunds |
|
| 64 |
| Fees | Bank fees, late fees, service charges |
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
## Performance
|
| 69 |
-
|
| 70 |
-
```
|
| 71 |
-
Validation Accuracy: 99.75% (2,394/2,400)
|
| 72 |
-
Real-World Accuracy: 90.0% (36/40 on unseen bank descriptions)
|
| 73 |
-
|
| 74 |
-
Per-Category Validation Accuracy:
|
| 75 |
-
Education 100.0%
|
| 76 |
-
Entertainment 100.0%
|
| 77 |
-
Fees 100.0%
|
| 78 |
-
Groceries 100.0%
|
| 79 |
-
Healthcare 100.0%
|
| 80 |
-
Housing 100.0%
|
| 81 |
-
Income 100.0%
|
| 82 |
-
Insurance 100.0%
|
| 83 |
-
Personal Care 100.0%
|
| 84 |
-
Restaurants 100.0%
|
| 85 |
-
Subscription 100.0%
|
| 86 |
-
Transfer 100.0%
|
| 87 |
-
Transportation 100.0%
|
| 88 |
-
Travel 100.0%
|
| 89 |
-
Utilities 100.0%
|
| 90 |
-
Shopping 96.1%
|
| 91 |
-
```
|
| 92 |
-
|
| 93 |
-
### Loss Curve
|
| 94 |
-
|
| 95 |
-
```
|
| 96 |
-
Epoch Train Loss Val Loss Train Acc Val Acc
|
| 97 |
-
─────────────────────────────────────────────────
|
| 98 |
-
1 2.6755 2.2898 16.5% 41.5%
|
| 99 |
-
2 1.6954 1.0686 59.1% 74.5%
|
| 100 |
-
5 0.3614 0.2245 90.5% 94.4%
|
| 101 |
-
10 0.0708 0.0468 98.2% 98.5%
|
| 102 |
-
15 0.0320 0.0160 99.1% 99.6%
|
| 103 |
-
20 0.0212 0.0144 99.4% 99.8%
|
| 104 |
-
```
|
| 105 |
-
|
| 106 |
-
### Real-World Test Results
|
| 107 |
-
|
| 108 |
-
Tested on actual transaction descriptions from US bank statements (not seen during training):
|
| 109 |
-
|
| 110 |
-
```
|
| 111 |
-
✓ Zelle payment to JOHN SMITH, CITY, CA → Transfer 100%
|
| 112 |
-
✓ AUTOMATIC PAYMENT - THANK → Transfer 100%
|
| 113 |
-
✓ STARBUCKS #12345 → Restaurants 93%
|
| 114 |
-
✓ CHEVRON 0203721 → Transportation 100%
|
| 115 |
-
✓ Netflix → Subscription 100%
|
| 116 |
-
✓ TARGET 00014720 → Shopping 100%
|
| 117 |
-
✓ FARMERS INS BILLING → Insurance 100%
|
| 118 |
-
✓ UBER EATS → Restaurants 100%
|
| 119 |
-
✓ WHOLE FOODS → Groceries 100%
|
| 120 |
-
✓ AMAZON MKTPL*RJ7GA07V1 → Shopping 100%
|
| 121 |
-
✓ AMAZON WEB SERVICES → Subscription 95%
|
| 122 |
-
✓ Mortgage payment → Housing 100%
|
| 123 |
-
✓ WELLS FARGO IFI DDA TO DDA ... → Transfer 100%
|
| 124 |
-
✓ Patelco CU PAYROLL PPD ID: ... → Income 99%
|
| 125 |
-
```
|
| 126 |
|
| 127 |
## Usage
|
| 128 |
|
| 129 |
-
### Python
|
| 130 |
|
| 131 |
```python
|
| 132 |
from transformers import pipeline
|
| 133 |
|
| 134 |
classifier = pipeline("text-classification", model="DoDataThings/distilbert-us-transaction-classifier")
|
| 135 |
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
"AMAZON MKTPL*AB1CD2EF3",
|
| 139 |
-
"Zelle payment to JANE DOE, SEATTLE, WA 12345678901",
|
| 140 |
-
"AUTOMATIC PAYMENT - THANK",
|
| 141 |
-
"FARMERS INS BILLING",
|
| 142 |
-
]
|
| 143 |
-
|
| 144 |
-
for text in transactions:
|
| 145 |
-
result = classifier(text)[0]
|
| 146 |
-
print(f"{text:50s} → {result['label']:20s} {result['score']:.0%}")
|
| 147 |
```
|
| 148 |
|
| 149 |
-
### JavaScript (Transformers.js
|
| 150 |
|
| 151 |
```javascript
|
| 152 |
const { pipeline } = require('@xenova/transformers');
|
|
@@ -157,59 +111,21 @@ const classifier = await pipeline(
|
|
| 157 |
);
|
| 158 |
|
| 159 |
const result = await classifier('STARBUCKS #1234');
|
| 160 |
-
|
| 161 |
```
|
| 162 |
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
The model includes an ONNX export in the `onnx/` subdirectory for use with ONNX Runtime, Transformers.js, or any ONNX-compatible runtime.
|
| 166 |
-
|
| 167 |
-
## Training Details
|
| 168 |
-
|
| 169 |
-
| Parameter | Value |
|
| 170 |
-
|-----------|-------|
|
| 171 |
-
| Base model | `distilbert-base-uncased` |
|
| 172 |
-
| Method | LoRA (r=32, alpha=64, dropout=0.1) |
|
| 173 |
-
| Target modules | q_lin, k_lin, v_lin, out_lin + classifier head |
|
| 174 |
-
| Trainable params | 1,782,544 / 68,748,320 (2.6%) |
|
| 175 |
-
| Dataset | 16,000 synthetic transactions (1,000 per category) |
|
| 176 |
-
| Epochs | 20 |
|
| 177 |
-
| Batch size | 32 |
|
| 178 |
-
| Learning rate | 3e-5 (linear warmup 10%) |
|
| 179 |
-
| Training time | ~5 minutes on NVIDIA RTX GPU |
|
| 180 |
-
|
| 181 |
-
### Training Data
|
| 182 |
-
|
| 183 |
-
The model was trained on synthetic transaction descriptions generated to match real US bank statement formats. Six distinct format templates cover the major US banks:
|
| 184 |
-
|
| 185 |
-
1. **ACH format** — fixed-width columns with `WEB ID:` or `PPD ID:` suffixes
|
| 186 |
-
2. **Merchant + store number** — `MERCHANT #1234` or `MERCHANT*ORDERID`
|
| 187 |
-
3. **Full address** — `MERCHANT ADDRESS CITY ZIP STATE COUNTRY`
|
| 188 |
-
4. **PayPal prefix** — `PreApproved Payment Bill User Payment: MERCHANT`
|
| 189 |
-
5. **Action prefix** — `Withdrawal from DESCRIPTION` / `Deposit from DESCRIPTION`
|
| 190 |
-
6. **Simple** — `MERCHANT` or `MERCHANT.COM`
|
| 191 |
-
|
| 192 |
-
Variations include randomized capitalization, spacing, store numbers, order IDs, city/state, and POS prefixes (`SQ *`, `TST*`).
|
| 193 |
-
|
| 194 |
-
The synthetic dataset is published separately at [DoDataThings/us-bank-transaction-categories](https://huggingface.co/datasets/DoDataThings/us-bank-transaction-categories).
|
| 195 |
-
|
| 196 |
-
## Recommended Use
|
| 197 |
-
|
| 198 |
-
This model works best as **one layer in a classification pipeline**:
|
| 199 |
|
| 200 |
-
|
| 201 |
-
2. **Bank-provided categories** — map bank's own classifications to your categories
|
| 202 |
-
3. **This model** — classifies everything else
|
| 203 |
-
4. **User overrides** — permanent manual corrections
|
| 204 |
|
| 205 |
-
The
|
| 206 |
|
| 207 |
## Limitations
|
| 208 |
|
| 209 |
-
-
|
| 210 |
-
-
|
| 211 |
-
-
|
| 212 |
-
-
|
| 213 |
|
| 214 |
## License
|
| 215 |
|
|
|
|
| 12 |
datasets:
|
| 13 |
- DoDataThings/us-bank-transaction-categories
|
| 14 |
pipeline_tag: text-classification
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
# DistilBERT US Bank Transaction Classifier
|
| 18 |
|
| 19 |
+
A fine-tuned DistilBERT model for classifying US bank transaction descriptions into 16 spending categories. Designed as a **fallback layer** in a multi-tier classification pipeline — not a standalone classifier.
|
| 20 |
|
| 21 |
+
## What This Is (and Isn't)
|
| 22 |
|
| 23 |
+
This model was built to handle the long tail of transaction descriptions that merchant rules and bank-provided categories don't cover. It works best as one component in a system:
|
| 24 |
|
| 25 |
+
1. **Merchant rules** — pattern matching catches known merchants (highest accuracy)
|
| 26 |
+
2. **Bank-provided categories** — map the bank's own classifications (when available)
|
| 27 |
+
3. **This model** — classifies everything else (the fallback)
|
| 28 |
+
4. **User overrides** — manual corrections for edge cases
|
| 29 |
+
|
| 30 |
+
On its own, this model will make mistakes. Bank transaction descriptions are messy, abbreviated, and sometimes genuinely ambiguous — the same store can be "Groceries" or "Shopping" depending on what was purchased, and no classifier can know that from the description alone.
|
| 31 |
+
|
| 32 |
+
## Training
|
| 33 |
+
|
| 34 |
+
Fine-tuned on 16,000 synthetic transaction descriptions generated to match real US bank statement formats. The synthetic data covers six format templates (Chase ACH, Apple Card addresses, PayPal prefixes, etc.) with randomized merchants, store numbers, and addresses.
|
| 35 |
+
|
| 36 |
+
```
|
| 37 |
+
Model: DistilBERT-base-uncased + LoRA (r=32, alpha=64)
|
| 38 |
+
Dataset: 16,000 synthetic samples, 1,000 per category
|
| 39 |
+
Trainable: 1.8M / 68.7M parameters (2.6%)
|
| 40 |
+
Training: 20 epochs, ~5 minutes on consumer GPU
|
| 41 |
```
|
| 42 |
+
|
| 43 |
+
### Loss Curve
|
| 44 |
+
|
| 45 |
+
```
|
| 46 |
+
Epoch Train Loss Val Loss Train Acc Val Acc
|
| 47 |
+
─────────────────────────────────────────────────
|
| 48 |
+
1 2.670 2.246 17.6% 47.2%
|
| 49 |
+
2 1.685 1.066 60.0% 74.6%
|
| 50 |
+
5 0.355 0.250 90.6% 93.8%
|
| 51 |
+
10 0.086 0.062 97.7% 98.3%
|
| 52 |
+
15 0.036 0.038 99.0% 99.0%
|
| 53 |
+
20 0.028 0.033 99.2% 99.3%
|
| 54 |
```
|
| 55 |
|
| 56 |
+
### Honest Assessment
|
| 57 |
+
|
| 58 |
+
The validation accuracy (99.3%) is on synthetic data — the same distribution the model was trained on. Real-world performance is lower because:
|
| 59 |
+
|
| 60 |
+
- **Bank descriptions are noisier** than synthetic data. Abbreviations, custom POS prefixes, and institution-specific formats the generator didn't cover.
|
| 61 |
+
- **Some categories are genuinely ambiguous.** WALGREENS is both a pharmacy (Healthcare) and a retail store (Shopping). ABC Stores in Hawaii sells groceries and souvenirs. The "correct" answer depends on what was purchased.
|
| 62 |
+
- **The model doesn't know every merchant.** It learned patterns and common names, but niche local businesses or new companies won't match training data.
|
| 63 |
+
- **ACH-format transactions are hard.** `INSTITUTION PURPOSE PPD ID:` looks the same whether it's insurance, utilities, or a subscription. The institution name is the only signal, and the model's vocabulary here is limited.
|
| 64 |
+
|
| 65 |
+
In our testing against ~1,800 real transactions with bank/rule-derived labels, the model's raw accuracy was ~85%. However, about 6% of the "ground truth" labels were actually wrong (bank miscategorizations), and the model correctly overrode them — adjusted accuracy is ~92%.
|
| 66 |
|
| 67 |
## Categories (16)
|
| 68 |
|
|
|
|
| 71 |
| Restaurants | Fast food, sit-down, coffee shops, food delivery |
|
| 72 |
| Groceries | Supermarkets, warehouse clubs, farmers markets |
|
| 73 |
| Shopping | Retail, online purchases, department stores |
|
| 74 |
+
| Transportation | Gas, rideshare, auto maintenance, parking |
|
| 75 |
| Entertainment | Movies, events, gaming |
|
| 76 |
| Utilities | Electric, internet, phone, water |
|
| 77 |
+
| Subscription | Streaming, SaaS, news, software |
|
| 78 |
| Healthcare | Pharmacy, doctor, dentist, therapy |
|
| 79 |
| Insurance | Auto, home, health, life insurance |
|
| 80 |
| Housing | Rent, mortgage, home maintenance |
|
| 81 |
+
| Travel | Hotels, airlines, car rental |
|
| 82 |
| Education | Online courses, books, tuition |
|
| 83 |
| Personal Care | Salon, gym, beauty, spa |
|
| 84 |
+
| Transfer | CC autopay, Zelle/Venmo sends, bank transfers |
|
| 85 |
| Income | Payroll, direct deposit, interest, refunds |
|
| 86 |
| Fees | Bank fees, late fees, service charges |
|
| 87 |
|
| 88 |
+
"Business" is intentionally not a category. Whether a transaction is a business expense depends on which account it's charged to, not the description. That's an account-level annotation, not a classification task.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
## Usage
|
| 91 |
|
| 92 |
+
### Python
|
| 93 |
|
| 94 |
```python
|
| 95 |
from transformers import pipeline
|
| 96 |
|
| 97 |
classifier = pipeline("text-classification", model="DoDataThings/distilbert-us-transaction-classifier")
|
| 98 |
|
| 99 |
+
result = classifier("STARBUCKS #1234")
|
| 100 |
+
print(result) # [{'label': 'Restaurants', 'score': 0.93}]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
```
|
| 102 |
|
| 103 |
+
### JavaScript (Transformers.js)
|
| 104 |
|
| 105 |
```javascript
|
| 106 |
const { pipeline } = require('@xenova/transformers');
|
|
|
|
| 111 |
);
|
| 112 |
|
| 113 |
const result = await classifier('STARBUCKS #1234');
|
| 114 |
+
// [{ label: 'Restaurants', score: 0.93 }]
|
| 115 |
```
|
| 116 |
|
| 117 |
+
An ONNX export is included in the `onnx/` subdirectory.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
+
## Training Data
|
|
|
|
|
|
|
|
|
|
| 120 |
|
| 121 |
+
The synthetic dataset is published at [DoDataThings/us-bank-transaction-categories](https://huggingface.co/datasets/DoDataThings/us-bank-transaction-categories). The generator script is open source — you can extend the merchant pools, add format templates, or increase sample counts to improve accuracy for your use case.
|
| 122 |
|
| 123 |
## Limitations
|
| 124 |
|
| 125 |
+
- **US bank formats only.** Trained on Chase, Apple Card, PayPal, and Capital One statement patterns. International bank descriptions will not classify well.
|
| 126 |
+
- **Synthetic training data.** The model learned from generated descriptions, not real transaction data. It may miss patterns that only appear in real bank exports.
|
| 127 |
+
- **Shopping is the weakest category** (~92% on synthetic) due to overlap with Subscription (Amazon sells both products and subscriptions) and Groceries (warehouse clubs).
|
| 128 |
+
- **Not a standalone solution.** This model is a fallback layer. For production use, pair it with merchant rules and bank category mappings for best results.
|
| 129 |
|
| 130 |
## License
|
| 131 |
|