DoDataThings commited on
Commit
7d19f46
·
verified ·
1 Parent(s): 2a8d194

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ library_name: transformers
5
+ tags:
6
+ - text-classification
7
+ - finance
8
+ - transactions
9
+ - distilbert
10
+ - onnx
11
+ - transformers.js
12
+ datasets:
13
+ - DoDataThings/us-bank-transaction-categories
14
+ pipeline_tag: text-classification
15
+ model-index:
16
+ - name: distilbert-us-transaction-classifier
17
+ results:
18
+ - task:
19
+ type: text-classification
20
+ name: Transaction Classification
21
+ metrics:
22
+ - type: accuracy
23
+ value: 0.9975
24
+ name: Validation Accuracy
25
+ ---
26
+
27
+ # DistilBERT US Bank Transaction Classifier
28
+
29
+ Fine-tuned DistilBERT model that classifies US bank transaction descriptions into 16 spending categories. Built for real bank statement formats — the messy, abbreviated, ALL-CAPS descriptions you actually see on Chase, Apple Card, PayPal, and Capital One statements.
30
+
31
+ ## Why This Exists
32
+
33
+ Off-the-shelf transaction classifiers are trained on clean data like `"Starbucks coffee"`. Real bank statements look like this:
34
+
35
+ ```
36
+ PAYPAL INST XFER GOOGLE YOUTUBE WEB ID: PAYPALSI77
37
+ AMAZON MKTPL*RJ7GA07V1
38
+ TST*TAISHOKEN RAMEN - MI
39
+ WELLS FARGO IFI DDA TO DDA FP0WP73DKR WEB ID: INTFITRVOS
40
+ AUTOMATIC PAYMENT - THANK
41
+ ```
42
+
43
+ We tested two popular HuggingFace transaction classifiers on real US bank descriptions. They scored **4/25** and **9/25**. This model scores **36/40**.
44
+
45
+ ## Categories (16)
46
+
47
+ | Category | What it covers |
48
+ |----------|---------------|
49
+ | Restaurants | Fast food, sit-down, coffee shops, food delivery |
50
+ | Groceries | Supermarkets, warehouse clubs, farmers markets |
51
+ | Shopping | Retail, online purchases, department stores |
52
+ | Transportation | Gas, rideshare, auto maintenance, parking, transit |
53
+ | Entertainment | Movies, events, gaming |
54
+ | Utilities | Electric, internet, phone, water |
55
+ | Subscription | Streaming, SaaS, news, software subscriptions |
56
+ | Healthcare | Pharmacy, doctor, dentist, therapy |
57
+ | Insurance | Auto, home, health, life insurance |
58
+ | Housing | Rent, mortgage, home maintenance |
59
+ | Travel | Hotels, airlines, car rental, booking sites |
60
+ | Education | Online courses, books, tuition |
61
+ | Personal Care | Salon, gym, beauty, spa |
62
+ | Transfer | CC autopay, Zelle/Venmo sends, bank transfers, loan payments |
63
+ | Income | Payroll, direct deposit, interest, refunds |
64
+ | Fees | Bank fees, late fees, service charges |
65
+
66
+ **Note:** "Business" is intentionally not a category. Whether a transaction is a business expense depends on which *account* it's charged to, not the merchant. An Anthropic subscription on a business account is a business expense; on a personal card it's a personal subscription. Both are classified as "Subscription" — the account context is a separate layer.
67
+
68
+ ## Performance
69
+
70
+ ```
71
+ Validation Accuracy: 99.75% (2,394/2,400)
72
+ Real-World Accuracy: 90.0% (36/40 on unseen bank descriptions)
73
+
74
+ Per-Category Validation Accuracy:
75
+ Education 100.0%
76
+ Entertainment 100.0%
77
+ Fees 100.0%
78
+ Groceries 100.0%
79
+ Healthcare 100.0%
80
+ Housing 100.0%
81
+ Income 100.0%
82
+ Insurance 100.0%
83
+ Personal Care 100.0%
84
+ Restaurants 100.0%
85
+ Subscription 100.0%
86
+ Transfer 100.0%
87
+ Transportation 100.0%
88
+ Travel 100.0%
89
+ Utilities 100.0%
90
+ Shopping 96.1%
91
+ ```
92
+
93
+ ### Loss Curve
94
+
95
+ ```
96
+ Epoch Train Loss Val Loss Train Acc Val Acc
97
+ ─────────────────────────────────────────────────
98
+ 1 2.6755 2.2898 16.5% 41.5%
99
+ 2 1.6954 1.0686 59.1% 74.5%
100
+ 5 0.3614 0.2245 90.5% 94.4%
101
+ 10 0.0708 0.0468 98.2% 98.5%
102
+ 15 0.0320 0.0160 99.1% 99.6%
103
+ 20 0.0212 0.0144 99.4% 99.8%
104
+ ```
105
+
106
+ ### Real-World Test Results
107
+
108
+ Tested on actual transaction descriptions from US bank statements (not seen during training):
109
+
110
+ ```
111
+ ✓ Zelle payment to JOHN SMITH, CITY, CA → Transfer 100%
112
+ ✓ AUTOMATIC PAYMENT - THANK → Transfer 100%
113
+ ✓ STARBUCKS #12345 → Restaurants 93%
114
+ ✓ CHEVRON 0203721 → Transportation 100%
115
+ ✓ Netflix → Subscription 100%
116
+ ✓ TARGET 00014720 → Shopping 100%
117
+ ✓ FARMERS INS BILLING → Insurance 100%
118
+ ✓ UBER EATS → Restaurants 100%
119
+ ✓ WHOLE FOODS → Groceries 100%
120
+ ✓ AMAZON MKTPL*RJ7GA07V1 → Shopping 100%
121
+ ✓ AMAZON WEB SERVICES → Subscription 95%
122
+ ✓ Mortgage payment → Housing 100%
123
+ ✓ WELLS FARGO IFI DDA TO DDA ... → Transfer 100%
124
+ ✓ Patelco CU PAYROLL PPD ID: ... → Income 99%
125
+ ```
126
+
127
+ ## Usage
128
+
129
+ ### Python (Transformers)
130
+
131
+ ```python
132
+ from transformers import pipeline
133
+
134
+ classifier = pipeline("text-classification", model="DoDataThings/distilbert-us-transaction-classifier")
135
+
136
+ transactions = [
137
+ "STARBUCKS #1234",
138
+ "AMAZON MKTPL*AB1CD2EF3",
139
+ "Zelle payment to JANE DOE, SEATTLE, WA 12345678901",
140
+ "AUTOMATIC PAYMENT - THANK",
141
+ "FARMERS INS BILLING",
142
+ ]
143
+
144
+ for text in transactions:
145
+ result = classifier(text)[0]
146
+ print(f"{text:50s} → {result['label']:20s} {result['score']:.0%}")
147
+ ```
148
+
149
+ ### JavaScript (Transformers.js / ONNX)
150
+
151
+ ```javascript
152
+ const { pipeline } = require('@xenova/transformers');
153
+
154
+ const classifier = await pipeline(
155
+ 'text-classification',
156
+ 'DoDataThings/distilbert-us-transaction-classifier'
157
+ );
158
+
159
+ const result = await classifier('STARBUCKS #1234');
160
+ console.log(result); // [{ label: 'Restaurants', score: 0.93 }]
161
+ ```
162
+
163
+ ### ONNX Runtime (direct)
164
+
165
+ The model includes an ONNX export in the `onnx/` subdirectory for use with ONNX Runtime, Transformers.js, or any ONNX-compatible runtime.
166
+
167
+ ## Training Details
168
+
169
+ | Parameter | Value |
170
+ |-----------|-------|
171
+ | Base model | `distilbert-base-uncased` |
172
+ | Method | LoRA (r=32, alpha=64, dropout=0.1) |
173
+ | Target modules | q_lin, k_lin, v_lin, out_lin + classifier head |
174
+ | Trainable params | 1,782,544 / 68,748,320 (2.6%) |
175
+ | Dataset | 16,000 synthetic transactions (1,000 per category) |
176
+ | Epochs | 20 |
177
+ | Batch size | 32 |
178
+ | Learning rate | 3e-5 (linear warmup 10%) |
179
+ | Training time | ~5 minutes on NVIDIA RTX GPU |
180
+
181
+ ### Training Data
182
+
183
+ The model was trained on synthetic transaction descriptions generated to match real US bank statement formats. Six distinct format templates cover the major US banks:
184
+
185
+ 1. **ACH format** — fixed-width columns with `WEB ID:` or `PPD ID:` suffixes
186
+ 2. **Merchant + store number** — `MERCHANT #1234` or `MERCHANT*ORDERID`
187
+ 3. **Full address** — `MERCHANT ADDRESS CITY ZIP STATE COUNTRY`
188
+ 4. **PayPal prefix** — `PreApproved Payment Bill User Payment: MERCHANT`
189
+ 5. **Action prefix** — `Withdrawal from DESCRIPTION` / `Deposit from DESCRIPTION`
190
+ 6. **Simple** — `MERCHANT` or `MERCHANT.COM`
191
+
192
+ Variations include randomized capitalization, spacing, store numbers, order IDs, city/state, and POS prefixes (`SQ *`, `TST*`).
193
+
194
+ The synthetic dataset is published separately at [DoDataThings/us-bank-transaction-categories](https://huggingface.co/datasets/DoDataThings/us-bank-transaction-categories).
195
+
196
+ ## Recommended Use
197
+
198
+ This model works best as **one layer in a classification pipeline**:
199
+
200
+ 1. **Merchant rules** (pattern matching) — catches known merchants and structural patterns
201
+ 2. **Bank-provided categories** — map bank's own classifications to your categories
202
+ 3. **This model** — classifies everything else
203
+ 4. **User overrides** — permanent manual corrections
204
+
205
+ The model handles the long tail that rules and bank categories miss. For the highest accuracy, combine all four layers.
206
+
207
+ ## Limitations
208
+
209
+ - Trained on US bank statement formats only — may not work well with international bank descriptions
210
+ - Shopping is the weakest category (96.1%) due to overlap with Groceries and Subscription
211
+ - Single-word descriptions like "Payment" are ambiguous — low confidence, should be handled by rules
212
+ - The model classifies by transaction description only — it cannot determine account-level context (personal vs business)
213
+
214
+ ## License
215
+
216
+ MIT
config.json ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "distilbert-base-uncased",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "DistilBertForSequenceClassification"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "id2label": {
12
+ "0": "Education",
13
+ "1": "Entertainment",
14
+ "2": "Fees",
15
+ "3": "Groceries",
16
+ "4": "Healthcare",
17
+ "5": "Housing",
18
+ "6": "Income",
19
+ "7": "Insurance",
20
+ "8": "Personal Care",
21
+ "9": "Restaurants",
22
+ "10": "Shopping",
23
+ "11": "Subscription",
24
+ "12": "Transfer",
25
+ "13": "Transportation",
26
+ "14": "Travel",
27
+ "15": "Utilities"
28
+ },
29
+ "initializer_range": 0.02,
30
+ "label2id": {
31
+ "Education": 0,
32
+ "Entertainment": 1,
33
+ "Fees": 2,
34
+ "Groceries": 3,
35
+ "Healthcare": 4,
36
+ "Housing": 5,
37
+ "Income": 6,
38
+ "Insurance": 7,
39
+ "Personal Care": 8,
40
+ "Restaurants": 9,
41
+ "Shopping": 10,
42
+ "Subscription": 11,
43
+ "Transfer": 12,
44
+ "Transportation": 13,
45
+ "Travel": 14,
46
+ "Utilities": 15
47
+ },
48
+ "max_position_embeddings": 512,
49
+ "model_type": "distilbert",
50
+ "n_heads": 12,
51
+ "n_layers": 6,
52
+ "pad_token_id": 0,
53
+ "problem_type": "single_label_classification",
54
+ "qa_dropout": 0.1,
55
+ "seq_classif_dropout": 0.2,
56
+ "sinusoidal_pos_embds": false,
57
+ "tie_weights_": true,
58
+ "torch_dtype": "float32",
59
+ "transformers_version": "4.49.0",
60
+ "vocab_size": 30522
61
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c09587c7ed91cd59dfeb05fce16127de5d3f78eaade35f6ec68c513d2a4f15f8
3
+ size 267875632
onnx/model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:faeeb15de6d752d2ee75a9ff63209d94768a4c831140976059f5e6803c5f4e23
3
+ size 267975237
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "pad_token": "[PAD]",
51
+ "sep_token": "[SEP]",
52
+ "strip_accents": null,
53
+ "tokenize_chinese_chars": true,
54
+ "tokenizer_class": "DistilBertTokenizer",
55
+ "unk_token": "[UNK]"
56
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff