Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

README.md +216 -0
config.json +61 -0
model.safetensors +3 -0
onnx/model.onnx +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +56 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,216 @@

+---
+language: en
+license: mit
+library_name: transformers
+tags:
+  - text-classification
+  - finance
+  - transactions
+  - distilbert
+  - onnx
+  - transformers.js
+datasets:
+  - DoDataThings/us-bank-transaction-categories
+pipeline_tag: text-classification
+model-index:
+  - name: distilbert-us-transaction-classifier
+    results:
+      - task:
+          type: text-classification
+          name: Transaction Classification
+        metrics:
+          - type: accuracy
+            value: 0.9975
+            name: Validation Accuracy
+---
+# DistilBERT US Bank Transaction Classifier
+Fine-tuned DistilBERT model that classifies US bank transaction descriptions into 16 spending categories. Built for real bank statement formats — the messy, abbreviated, ALL-CAPS descriptions you actually see on Chase, Apple Card, PayPal, and Capital One statements.
+## Why This Exists
+Off-the-shelf transaction classifiers are trained on clean data like `"Starbucks coffee"`. Real bank statements look like this:
+```
+PAYPAL           INST XFER  GOOGLE YOUTUBE  WEB ID: PAYPALSI77
+AMAZON MKTPL*RJ7GA07V1
+TST*TAISHOKEN RAMEN - MI
+WELLS FARGO IFI  DDA TO DDA FP0WP73DKR      WEB ID: INTFITRVOS
+AUTOMATIC PAYMENT - THANK
+```
+We tested two popular HuggingFace transaction classifiers on real US bank descriptions. They scored **4/25** and **9/25**. This model scores **36/40**.
+## Categories (16)
+| Category | What it covers |
+|----------|---------------|
+| Restaurants | Fast food, sit-down, coffee shops, food delivery |
+| Groceries | Supermarkets, warehouse clubs, farmers markets |
+| Shopping | Retail, online purchases, department stores |
+| Transportation | Gas, rideshare, auto maintenance, parking, transit |
+| Entertainment | Movies, events, gaming |
+| Utilities | Electric, internet, phone, water |
+| Subscription | Streaming, SaaS, news, software subscriptions |
+| Healthcare | Pharmacy, doctor, dentist, therapy |
+| Insurance | Auto, home, health, life insurance |
+| Housing | Rent, mortgage, home maintenance |
+| Travel | Hotels, airlines, car rental, booking sites |
+| Education | Online courses, books, tuition |
+| Personal Care | Salon, gym, beauty, spa |
+| Transfer | CC autopay, Zelle/Venmo sends, bank transfers, loan payments |
+| Income | Payroll, direct deposit, interest, refunds |
+| Fees | Bank fees, late fees, service charges |
+**Note:** "Business" is intentionally not a category. Whether a transaction is a business expense depends on which *account* it's charged to, not the merchant. An Anthropic subscription on a business account is a business expense; on a personal card it's a personal subscription. Both are classified as "Subscription" — the account context is a separate layer.
+## Performance
+```
+Validation Accuracy:  99.75%  (2,394/2,400)
+Real-World Accuracy:  90.0%   (36/40 on unseen bank descriptions)
+Per-Category Validation Accuracy:
+  Education           100.0%
+  Entertainment       100.0%
+  Fees                100.0%
+  Groceries           100.0%
+  Healthcare          100.0%
+  Housing             100.0%
+  Income              100.0%
+  Insurance           100.0%
+  Personal Care       100.0%
+  Restaurants         100.0%
+  Subscription        100.0%
+  Transfer            100.0%
+  Transportation      100.0%
+  Travel              100.0%
+  Utilities           100.0%
+  Shopping             96.1%
+```
+### Loss Curve
+```
+Epoch  Train Loss   Val Loss  Train Acc   Val Acc
+─────────────────────────────────────────────────
+    1      2.6755     2.2898     16.5%     41.5%
+    2      1.6954     1.0686     59.1%     74.5%
+    5      0.3614     0.2245     90.5%     94.4%
+   10      0.0708     0.0468     98.2%     98.5%
+   15      0.0320     0.0160     99.1%     99.6%
+   20      0.0212     0.0144     99.4%     99.8%
+```
+### Real-World Test Results
+Tested on actual transaction descriptions from US bank statements (not seen during training):
+```
+✓ Zelle payment to JOHN SMITH, CITY, CA    → Transfer        100%
+✓ AUTOMATIC PAYMENT - THANK                → Transfer        100%
+✓ STARBUCKS #12345                         → Restaurants      93%
+✓ CHEVRON 0203721                          → Transportation  100%
+✓ Netflix                                  → Subscription    100%
+✓ TARGET 00014720                          → Shopping        100%
+✓ FARMERS INS BILLING                      → Insurance       100%
+✓ UBER EATS                               → Restaurants     100%
+✓ WHOLE FOODS                              → Groceries       100%
+✓ AMAZON MKTPL*RJ7GA07V1                  → Shopping        100%
+✓ AMAZON WEB SERVICES                      → Subscription     95%
+✓ Mortgage payment                         → Housing         100%
+✓ WELLS FARGO IFI DDA TO DDA ...          → Transfer        100%
+✓ Patelco CU PAYROLL PPD ID: ...          → Income           99%
+```
+## Usage
+### Python (Transformers)
+```python
+from transformers import pipeline
+classifier = pipeline("text-classification", model="DoDataThings/distilbert-us-transaction-classifier")
+transactions = [
+    "STARBUCKS #1234",
+    "AMAZON MKTPL*AB1CD2EF3",
+    "Zelle payment to JANE DOE, SEATTLE, WA 12345678901",
+    "AUTOMATIC PAYMENT - THANK",
+    "FARMERS INS BILLING",
+]
+for text in transactions:
+    result = classifier(text)[0]
+    print(f"{text:50s} → {result['label']:20s} {result['score']:.0%}")
+```
+### JavaScript (Transformers.js / ONNX)
+```javascript
+const { pipeline } = require('@xenova/transformers');
+const classifier = await pipeline(
+  'text-classification',
+  'DoDataThings/distilbert-us-transaction-classifier'
+);
+const result = await classifier('STARBUCKS #1234');
+console.log(result); // [{ label: 'Restaurants', score: 0.93 }]
+```
+### ONNX Runtime (direct)
+The model includes an ONNX export in the `onnx/` subdirectory for use with ONNX Runtime, Transformers.js, or any ONNX-compatible runtime.
+## Training Details
+| Parameter | Value |
+|-----------|-------|
+| Base model | `distilbert-base-uncased` |
+| Method | LoRA (r=32, alpha=64, dropout=0.1) |
+| Target modules | q_lin, k_lin, v_lin, out_lin + classifier head |
+| Trainable params | 1,782,544 / 68,748,320 (2.6%) |
+| Dataset | 16,000 synthetic transactions (1,000 per category) |
+| Epochs | 20 |
+| Batch size | 32 |
+| Learning rate | 3e-5 (linear warmup 10%) |
+| Training time | ~5 minutes on NVIDIA RTX GPU |
+### Training Data
+The model was trained on synthetic transaction descriptions generated to match real US bank statement formats. Six distinct format templates cover the major US banks:
+1. **ACH format** — fixed-width columns with `WEB ID:` or `PPD ID:` suffixes
+2. **Merchant + store number** — `MERCHANT #1234` or `MERCHANT*ORDERID`
+3. **Full address** — `MERCHANT ADDRESS CITY ZIP STATE COUNTRY`
+4. **PayPal prefix** — `PreApproved Payment Bill User Payment: MERCHANT`
+5. **Action prefix** — `Withdrawal from DESCRIPTION` / `Deposit from DESCRIPTION`
+6. **Simple** — `MERCHANT` or `MERCHANT.COM`
+Variations include randomized capitalization, spacing, store numbers, order IDs, city/state, and POS prefixes (`SQ *`, `TST*`).
+The synthetic dataset is published separately at [DoDataThings/us-bank-transaction-categories](https://huggingface.co/datasets/DoDataThings/us-bank-transaction-categories).
+## Recommended Use
+This model works best as **one layer in a classification pipeline**:
+1. **Merchant rules** (pattern matching) — catches known merchants and structural patterns
+2. **Bank-provided categories** — map bank's own classifications to your categories
+3. **This model** — classifies everything else
+4. **User overrides** — permanent manual corrections
+The model handles the long tail that rules and bank categories miss. For the highest accuracy, combine all four layers.
+## Limitations
+- Trained on US bank statement formats only — may not work well with international bank descriptions
+- Shopping is the weakest category (96.1%) due to overlap with Groceries and Subscription
+- Single-word descriptions like "Payment" are ambiguous — low confidence, should be handled by rules
+- The model classifies by transaction description only — it cannot determine account-level context (personal vs business)
+## License
+MIT

config.json ADDED Viewed

	@@ -0,0 +1,61 @@

+{
+  "_name_or_path": "distilbert-base-uncased",
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "Education",
+    "1": "Entertainment",
+    "2": "Fees",
+    "3": "Groceries",
+    "4": "Healthcare",
+    "5": "Housing",
+    "6": "Income",
+    "7": "Insurance",
+    "8": "Personal Care",
+    "9": "Restaurants",
+    "10": "Shopping",
+    "11": "Subscription",
+    "12": "Transfer",
+    "13": "Transportation",
+    "14": "Travel",
+    "15": "Utilities"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "Education": 0,
+    "Entertainment": 1,
+    "Fees": 2,
+    "Groceries": 3,
+    "Healthcare": 4,
+    "Housing": 5,
+    "Income": 6,
+    "Insurance": 7,
+    "Personal Care": 8,
+    "Restaurants": 9,
+    "Shopping": 10,
+    "Subscription": 11,
+    "Transfer": 12,
+    "Transportation": 13,
+    "Travel": 14,
+    "Utilities": 15
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "problem_type": "single_label_classification",
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.49.0",
+  "vocab_size": 30522
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c09587c7ed91cd59dfeb05fce16127de5d3f78eaade35f6ec68c513d2a4f15f8
+size 267875632

onnx/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:faeeb15de6d752d2ee75a9ff63209d94768a4c831140976059f5e6803c5f4e23
+size 267975237

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff