DoDataThings commited on
Commit
e9bf504
Β·
verified Β·
1 Parent(s): 84f4d09

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +143 -132
README.md CHANGED
@@ -1,132 +1,143 @@
1
- ---
2
- language: en
3
- license: apache-2.0
4
- library_name: transformers
5
- tags:
6
- - text-classification
7
- - finance
8
- - transactions
9
- - distilbert
10
- - onnx
11
- - transformers.js
12
- datasets:
13
- - DoDataThings/us-bank-transaction-categories
14
- pipeline_tag: text-classification
15
- ---
16
-
17
- # DistilBERT US Bank Transaction Classifier
18
-
19
- A fine-tuned DistilBERT model for classifying US bank transaction descriptions into 16 spending categories. Designed as a **fallback layer** in a multi-tier classification pipeline β€” not a standalone classifier.
20
-
21
- ## What This Is (and Isn't)
22
-
23
- This model was built to handle the long tail of transaction descriptions that merchant rules and bank-provided categories don't cover. It works best as one component in a system:
24
-
25
- 1. **Merchant rules** β€” pattern matching catches known merchants (highest accuracy)
26
- 2. **Bank-provided categories** β€” map the bank's own classifications (when available)
27
- 3. **This model** β€” classifies everything else (the fallback)
28
- 4. **User overrides** β€” manual corrections for edge cases
29
-
30
- On its own, this model will make mistakes. Bank transaction descriptions are messy, abbreviated, and sometimes genuinely ambiguous β€” the same store can be "Groceries" or "Shopping" depending on what was purchased, and no classifier can know that from the description alone.
31
-
32
- ## Training
33
-
34
- Fine-tuned on 16,000 synthetic transaction descriptions generated to match real US bank statement formats. The synthetic data covers six format templates (Chase ACH, Apple Card addresses, PayPal prefixes, etc.) with randomized merchants, store numbers, and addresses.
35
-
36
- ```
37
- Model: DistilBERT-base-uncased + LoRA (r=32, alpha=64)
38
- Dataset: 16,000 synthetic samples, 1,000 per category
39
- Trainable: 1.8M / 68.7M parameters (2.6%)
40
- Training: 20 epochs, ~5 minutes on consumer GPU
41
- ```
42
-
43
- ### Loss Curve
44
-
45
- ```
46
- Epoch Train Loss Val Loss Train Acc Val Acc
47
- ─────────────────────────────────────────────────
48
- 1 2.670 2.246 17.6% 47.2%
49
- 2 1.685 1.066 60.0% 74.6%
50
- 5 0.355 0.250 90.6% 93.8%
51
- 10 0.086 0.062 97.7% 98.3%
52
- 15 0.036 0.038 99.0% 99.0%
53
- 20 0.028 0.033 99.2% 99.3%
54
- ```
55
-
56
- ### Honest Assessment
57
-
58
- The validation accuracy (99.3%) is on synthetic data β€” the same distribution the model was trained on. Real-world performance is lower because:
59
-
60
- - **Bank descriptions are noisier** than synthetic data. Abbreviations, custom POS prefixes, and institution-specific formats the generator didn't cover.
61
- - **Some categories are genuinely ambiguous.** WALGREENS is both a pharmacy (Healthcare) and a retail store (Shopping). ABC Stores in Hawaii sells groceries and souvenirs. The "correct" answer depends on what was purchased.
62
- - **The model doesn't know every merchant.** It learned patterns and common names, but niche local businesses or new companies won't match training data.
63
- - **ACH-format transactions are hard.** `INSTITUTION PURPOSE PPD ID:` looks the same whether it's insurance, utilities, or a subscription. The institution name is the only signal, and the model's vocabulary here is limited.
64
-
65
- In our testing against ~1,800 real transactions with bank/rule-derived labels, the model's raw accuracy was ~85%. However, about 6% of the "ground truth" labels were actually wrong (bank miscategorizations), and the model correctly overrode them β€” adjusted accuracy is ~92%.
66
-
67
- ## Categories (16)
68
-
69
- | Category | What it covers |
70
- |----------|---------------|
71
- | Restaurants | Fast food, sit-down, coffee shops, food delivery |
72
- | Groceries | Supermarkets, warehouse clubs, farmers markets |
73
- | Shopping | Retail, online purchases, department stores |
74
- | Transportation | Gas, rideshare, auto maintenance, parking |
75
- | Entertainment | Movies, events, gaming |
76
- | Utilities | Electric, internet, phone, water |
77
- | Subscription | Streaming, SaaS, news, software |
78
- | Healthcare | Pharmacy, doctor, dentist, therapy |
79
- | Insurance | Auto, home, health, life insurance |
80
- | Housing | Rent, mortgage, home maintenance |
81
- | Travel | Hotels, airlines, car rental |
82
- | Education | Online courses, books, tuition |
83
- | Personal Care | Salon, gym, beauty, spa |
84
- | Transfer | CC autopay, Zelle/Venmo sends, bank transfers |
85
- | Income | Payroll, direct deposit, interest, refunds |
86
- | Fees | Bank fees, late fees, service charges |
87
-
88
- "Business" is intentionally not a category. Whether a transaction is a business expense depends on which account it's charged to, not the description. That's an account-level annotation, not a classification task.
89
-
90
- ## Usage
91
-
92
- ### Python
93
-
94
- ```python
95
- from transformers import pipeline
96
-
97
- classifier = pipeline("text-classification", model="DoDataThings/distilbert-us-transaction-classifier")
98
-
99
- result = classifier("STARBUCKS #1234")
100
- print(result) # [{'label': 'Restaurants', 'score': 0.93}]
101
- ```
102
-
103
- ### JavaScript (Transformers.js)
104
-
105
- ```javascript
106
- const { pipeline } = require('@xenova/transformers');
107
-
108
- const classifier = await pipeline(
109
- 'text-classification',
110
- 'DoDataThings/distilbert-us-transaction-classifier'
111
- );
112
-
113
- const result = await classifier('STARBUCKS #1234');
114
- // [{ label: 'Restaurants', score: 0.93 }]
115
- ```
116
-
117
- An ONNX export is included in the `onnx/` subdirectory.
118
-
119
- ## Training Data
120
-
121
- The synthetic dataset is published at [DoDataThings/us-bank-transaction-categories](https://huggingface.co/datasets/DoDataThings/us-bank-transaction-categories). The generator script is open source β€” you can extend the merchant pools, add format templates, or increase sample counts to improve accuracy for your use case.
122
-
123
- ## Limitations
124
-
125
- - **US bank formats only.** Trained on Chase, Apple Card, PayPal, and Capital One statement patterns. International bank descriptions will not classify well.
126
- - **Synthetic training data.** The model learned from generated descriptions, not real transaction data. It may miss patterns that only appear in real bank exports.
127
- - **Shopping is the weakest category** (~92% on synthetic) due to overlap with Subscription (Amazon sells both products and subscriptions) and Groceries (warehouse clubs).
128
- - **Not a standalone solution.** This model is a fallback layer. For production use, pair it with merchant rules and bank category mappings for best results.
129
-
130
- ## License
131
-
132
- Apache 2.0
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ library_name: transformers
5
+ tags:
6
+ - text-classification
7
+ - finance
8
+ - transactions
9
+ - distilbert
10
+ - onnx
11
+ - transformers.js
12
+ datasets:
13
+ - DoDataThings/us-bank-transaction-categories
14
+ pipeline_tag: text-classification
15
+ ---
16
+
17
+ ## Update Notice
18
+
19
+ > **A newer version of this model is available:** [DoDataThings/distilbert-us-transaction-classifier-v2](https://huggingface.co/DoDataThings/distilbert-us-transaction-classifier-v2)
20
+ >
21
+ > v2 adds sign-aware classification (`[debit]`/`[credit]` prefix), expanded merchant coverage (500+ merchants), PayPal wrapper handling, and a refined 16-category taxonomy. See the v2 model card for details.
22
+ >
23
+ > This v1 model remains available for backward compatibility. If you don't need sign-aware classification, v1 still works well for basic transaction categorization.
24
+
25
+
26
+
27
+
28
+ # DistilBERT US Bank Transaction Classifier
29
+
30
+ A fine-tuned DistilBERT model for classifying US bank transaction descriptions into 16 spending categories. Designed as a **fallback layer** in a multi-tier classification pipeline β€” not a standalone classifier.
31
+
32
+ ## What This Is (and Isn't)
33
+
34
+ This model was built to handle the long tail of transaction descriptions that merchant rules and bank-provided categories don't cover. It works best as one component in a system:
35
+
36
+ 1. **Merchant rules** β€” pattern matching catches known merchants (highest accuracy)
37
+ 2. **Bank-provided categories** β€” map the bank's own classifications (when available)
38
+ 3. **This model** β€” classifies everything else (the fallback)
39
+ 4. **User overrides** β€” manual corrections for edge cases
40
+
41
+ On its own, this model will make mistakes. Bank transaction descriptions are messy, abbreviated, and sometimes genuinely ambiguous β€” the same store can be "Groceries" or "Shopping" depending on what was purchased, and no classifier can know that from the description alone.
42
+
43
+ ## Training
44
+
45
+ Fine-tuned on 16,000 synthetic transaction descriptions generated to match real US bank statement formats. The synthetic data covers six format templates (Chase ACH, Apple Card addresses, PayPal prefixes, etc.) with randomized merchants, store numbers, and addresses.
46
+
47
+ ```
48
+ Model: DistilBERT-base-uncased + LoRA (r=32, alpha=64)
49
+ Dataset: 16,000 synthetic samples, 1,000 per category
50
+ Trainable: 1.8M / 68.7M parameters (2.6%)
51
+ Training: 20 epochs, ~5 minutes on consumer GPU
52
+ ```
53
+
54
+ ### Loss Curve
55
+
56
+ ```
57
+ Epoch Train Loss Val Loss Train Acc Val Acc
58
+ ─────────────────────────────────────────────────
59
+ 1 2.670 2.246 17.6% 47.2%
60
+ 2 1.685 1.066 60.0% 74.6%
61
+ 5 0.355 0.250 90.6% 93.8%
62
+ 10 0.086 0.062 97.7% 98.3%
63
+ 15 0.036 0.038 99.0% 99.0%
64
+ 20 0.028 0.033 99.2% 99.3%
65
+ ```
66
+
67
+ ### Honest Assessment
68
+
69
+ The validation accuracy (99.3%) is on synthetic data β€” the same distribution the model was trained on. Real-world performance is lower because:
70
+
71
+ - **Bank descriptions are noisier** than synthetic data. Abbreviations, custom POS prefixes, and institution-specific formats the generator didn't cover.
72
+ - **Some categories are genuinely ambiguous.** WALGREENS is both a pharmacy (Healthcare) and a retail store (Shopping). ABC Stores in Hawaii sells groceries and souvenirs. The "correct" answer depends on what was purchased.
73
+ - **The model doesn't know every merchant.** It learned patterns and common names, but niche local businesses or new companies won't match training data.
74
+ - **ACH-format transactions are hard.** `INSTITUTION PURPOSE PPD ID:` looks the same whether it's insurance, utilities, or a subscription. The institution name is the only signal, and the model's vocabulary here is limited.
75
+
76
+ In our testing against ~1,800 real transactions with bank/rule-derived labels, the model's raw accuracy was ~85%. However, about 6% of the "ground truth" labels were actually wrong (bank miscategorizations), and the model correctly overrode them β€” adjusted accuracy is ~92%.
77
+
78
+ ## Categories (16)
79
+
80
+ | Category | What it covers |
81
+ |----------|---------------|
82
+ | Restaurants | Fast food, sit-down, coffee shops, food delivery |
83
+ | Groceries | Supermarkets, warehouse clubs, farmers markets |
84
+ | Shopping | Retail, online purchases, department stores |
85
+ | Transportation | Gas, rideshare, auto maintenance, parking |
86
+ | Entertainment | Movies, events, gaming |
87
+ | Utilities | Electric, internet, phone, water |
88
+ | Subscription | Streaming, SaaS, news, software |
89
+ | Healthcare | Pharmacy, doctor, dentist, therapy |
90
+ | Insurance | Auto, home, health, life insurance |
91
+ | Housing | Rent, mortgage, home maintenance |
92
+ | Travel | Hotels, airlines, car rental |
93
+ | Education | Online courses, books, tuition |
94
+ | Personal Care | Salon, gym, beauty, spa |
95
+ | Transfer | CC autopay, Zelle/Venmo sends, bank transfers |
96
+ | Income | Payroll, direct deposit, interest, refunds |
97
+ | Fees | Bank fees, late fees, service charges |
98
+
99
+ "Business" is intentionally not a category. Whether a transaction is a business expense depends on which account it's charged to, not the description. That's an account-level annotation, not a classification task.
100
+
101
+ ## Usage
102
+
103
+ ### Python
104
+
105
+ ```python
106
+ from transformers import pipeline
107
+
108
+ classifier = pipeline("text-classification", model="DoDataThings/distilbert-us-transaction-classifier")
109
+
110
+ result = classifier("STARBUCKS #1234")
111
+ print(result) # [{'label': 'Restaurants', 'score': 0.93}]
112
+ ```
113
+
114
+ ### JavaScript (Transformers.js)
115
+
116
+ ```javascript
117
+ const { pipeline } = require('@xenova/transformers');
118
+
119
+ const classifier = await pipeline(
120
+ 'text-classification',
121
+ 'DoDataThings/distilbert-us-transaction-classifier'
122
+ );
123
+
124
+ const result = await classifier('STARBUCKS #1234');
125
+ // [{ label: 'Restaurants', score: 0.93 }]
126
+ ```
127
+
128
+ An ONNX export is included in the `onnx/` subdirectory.
129
+
130
+ ## Training Data
131
+
132
+ The synthetic dataset is published at [DoDataThings/us-bank-transaction-categories](https://huggingface.co/datasets/DoDataThings/us-bank-transaction-categories). The generator script is open source β€” you can extend the merchant pools, add format templates, or increase sample counts to improve accuracy for your use case.
133
+
134
+ ## Limitations
135
+
136
+ - **US bank formats only.** Trained on Chase, Apple Card, PayPal, and Capital One statement patterns. International bank descriptions will not classify well.
137
+ - **Synthetic training data.** The model learned from generated descriptions, not real transaction data. It may miss patterns that only appear in real bank exports.
138
+ - **Shopping is the weakest category** (~92% on synthetic) due to overlap with Subscription (Amazon sells both products and subscriptions) and Groceries (warehouse clubs).
139
+ - **Not a standalone solution.** This model is a fallback layer. For production use, pair it with merchant rules and bank category mappings for best results.
140
+
141
+ ## License
142
+
143
+ Apache 2.0