Abhishek Dey commited on
Commit
2155b20
Β·
1 Parent(s): 11f94a4

Initial release: compliance classifier v1 (134M params, 99.2% accuracy)

Browse files
README.md CHANGED
@@ -1,3 +1,178 @@
1
  ---
2
  license: cc-by-nc-4.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
+ - hi
6
+ tags:
7
+ - bert
8
+ - classifier
9
+ - compliance
10
+ - pii-detection
11
+ - fsi
12
+ - query-routing
13
+ - financial-services
14
+ library_name: pytorch
15
+ pipeline_tag: text-classification
16
+ model-index:
17
+ - name: deycoding.compliance-classifier-in-1-0
18
+ results:
19
+ - task:
20
+ type: text-classification
21
+ metrics:
22
+ - name: Accuracy
23
+ type: accuracy
24
+ value: 99.2
25
  ---
26
+
27
+ # BERT Compliance Classifier Router
28
+
29
+ A 134M parameter BERT encoder model trained from scratch for Financial Services (FSI) query classification with PII detection and compliance-aware routing.
30
+
31
+ ## Model Description
32
+
33
+ This model classifies incoming user queries into 4 routing categories for cost-optimized, compliance-aware LLM serving in regulated industries:
34
+
35
+ | Label | Complexity | PII | Routing Action |
36
+ |-------|-----------|-----|----------------|
37
+ | `simple_no_pii` | Low | No | Small model, cross-region allowed |
38
+ | `simple_pii` | Low | Yes | Small model, local only (data residency) |
39
+ | `complex_no_pii` | High | No | Large model, cross-region allowed |
40
+ | `complex_pii` | High | Yes | Large model, local only (data residency) |
41
+
42
+ ## Key Results
43
+
44
+ - **Accuracy:** 99.2%
45
+ - **PII Recall:** ~100%
46
+ - **Latency:** ~7ms (GPU) / ~72ms (CPU)
47
+ - **Throughput:** ~130 queries/sec per GPU
48
+ - **Model Size:** 134M parameters / ~530 MB
49
+
50
+ ## Files
51
+
52
+ | File | Description |
53
+ |------|-------------|
54
+ | `deycoding.compliance-classifier-in-1-0.pt` | Model weights (PyTorch state_dict) |
55
+ | `deycoding.compliance-classifier-in-1-0.json` | BPE Tokenizer (32K vocab) |
56
+
57
+ ## Architecture
58
+
59
+ - **Type:** BERT Encoder (bidirectional transformer, no causal mask)
60
+ - **Dimensions:** 768
61
+ - **Layers:** 12
62
+ - **Attention Heads:** 12
63
+ - **FFN Dimension:** 3072
64
+ - **Max Sequence Length:** 128 tokens (inference) / 512 tokens (pre-training)
65
+ - **Vocabulary:** 32,000 (BPE, includes `<mask>` token)
66
+ - **Activation:** GELU
67
+ - **Normalization:** LayerNorm
68
+ - **Classification Head:** Linear(768β†’768) β†’ Tanh β†’ Dropout β†’ Linear(768β†’4)
69
+
70
+ ## Training
71
+
72
+ ### Pre-training
73
+ - **Objective:** Masked Language Model (MLM), 15% masking (80/10/10)
74
+ - **Data:** English Wikipedia (2B tokens, 500K steps)
75
+ - **Batch size:** 8, sequence length: 512
76
+ - **LR:** 1e-4 β†’ 1e-5 (cosine schedule, warmup 2000 steps)
77
+ - **Hardware:** NVIDIA L4 (24 GB), ~48 hours
78
+ - **Final Loss:** 1.815
79
+
80
+ ### Fine-tuning
81
+ - **Data:** 50,000 synthetic FSI examples (balanced, 12,500 per class)
82
+ - **PII Types:** 14 (PAN, Aadhaar, phone, email, UPI, DOB, card, DL, voter, passport, address, IFSC)
83
+ - **Input Formats:** Structured + unstructured (human-typed messy input)
84
+ - **Languages:** English + Hinglish (15%)
85
+ - **Steps:** 8,000, batch=32, LR=2e-5 β†’ 2e-6
86
+ - **Hardware:** NVIDIA L4, ~15 minutes
87
+ - **Final Accuracy:** 99.2%
88
+
89
+ ## Usage
90
+
91
+ ```python
92
+ import torch
93
+ import torch.nn.functional as F
94
+ from tokenizers import Tokenizer
95
+
96
+ # Load tokenizer
97
+ tokenizer = Tokenizer.from_file("deycoding.compliance-classifier-in-1-0.json")
98
+
99
+ # Load model (requires architecture definition β€” see repository)
100
+ model.load_state_dict(torch.load("deycoding.compliance-classifier-in-1-0.pt", map_location="cpu"))
101
+ model.eval()
102
+
103
+ # Classify a query
104
+ text = "Check balance for Amit Patel account 4532-8876-1234"
105
+ ids = tokenizer.encode(text).ids[:128]
106
+ pad_len = 128 - len(ids)
107
+ input_ids = torch.tensor([ids + [0] * pad_len])
108
+ attn_mask = torch.tensor([[1] * len(ids) + [0] * pad_len])
109
+
110
+ with torch.no_grad():
111
+ probs = F.softmax(model(input_ids, pad_mask=attn_mask), dim=-1)
112
+
113
+ labels = ["simple_no_pii", "simple_pii", "complex_no_pii", "complex_pii"]
114
+ prediction = labels[probs.argmax().item()]
115
+ confidence = probs.max().item() * 100
116
+ print(f"{prediction} ({confidence:.1f}%)")
117
+ # Output: simple_pii (100.0%)
118
+ ```
119
+
120
+ ## PII Detection Capabilities
121
+
122
+ Detects personal identifiable information in both structured and unstructured (human-typed) formats:
123
+
124
+ | PII Type | Structured | Unstructured |
125
+ |----------|-----------|--------------|
126
+ | PAN | ABCDE1234F | pan abcde1234f |
127
+ | Aadhaar | 1234 5678 9012 | aadhar no 123456789012 |
128
+ | Phone | +91-98765-43210 | my number is 9876543210 |
129
+ | Email | name@gmail.com | name at gmail dot com |
130
+ | UPI | name@oksbi | my upi is 9876@paytm |
131
+ | Account | 1234-5678-9012 | a/c 12345678 |
132
+ | Card | XXXX-XXXX-XXXX-1234 | card ending 1234 |
133
+ | DOB | 15/03/1990 | born on 15 march 1990 |
134
+ | DL | MH-0120190012345 | dl number mh01 2019 0012345 |
135
+ | Passport | J1234567 | passport J1234567 |
136
+ | Voter ID | ABC1234567 | voter id ABC1234567 |
137
+ | Address | Flat 4B, Tower 2, Koramangala | flat 4b tower 2 koramangala bangalore 560034 |
138
+ | IFSC | SBIN0123456 | ifsc SBIN0123456 |
139
+
140
+ ## Intended Use
141
+
142
+ - Query routing in multi-tier LLM serving architectures
143
+ - PII detection for data residency compliance (GDPR, RBI, DPDP Act)
144
+ - Cost optimization β€” route simple queries to cheaper models (65-73% savings)
145
+ - Financial services, healthcare, legal β€” any regulated industry
146
+
147
+ ## Limitations
148
+
149
+ - Trained on synthetic data β€” fine-tune on real queries for production
150
+ - English + Hinglish only β€” other languages not covered
151
+ - Max 128 tokens β€” very long queries get truncated
152
+ - PII detection is learned (not regex) β€” may miss novel PII formats not in training data
153
+
154
+ ## Ethical Considerations
155
+
156
+ - Model makes routing decisions, not content decisions
157
+ - PII detection is conservative (prefers false positive over false negative)
158
+ - Data residency enforcement is architectural β€” PII queries physically cannot reach cross-region infrastructure
159
+
160
+ ## Citation
161
+
162
+ ```bibtex
163
+ @misc{dey2026classifier,
164
+ title={Classifier-Gated Multi-Tier LLM Routing for Cost-Optimized Serving in Regulated Industries},
165
+ author={Abhishek Dey},
166
+ year={2026},
167
+ url={https://huggingface.co/deycoding/bert-compliance-classifier-router}
168
+ }
169
+ ```
170
+
171
+ ## Author
172
+
173
+ **Abhishek Dey**
174
+ - HuggingFace: [deycoding](https://huggingface.co/deycoding)
175
+
176
+ ## License
177
+
178
+ CC-BY-NC-4.0 β€” Non-commercial use permitted with attribution. Commercial licensing available upon request. Contact author for commercial inquiries.
deycoding.compliance-classifier-in-1-0.json ADDED
The diff for this file is too large to render. See raw diff
 
deycoding.compliance-classifier-in-1-0.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4b8d98aff65ef2cdf39b8476c53341ed7308a53bfaa08289874f1e81ddd6554e
3
+ size 441368485